
This project aims to explore global causes of death and uncover trends across different regions and time periods using various data analysis and visualization tools. The goal is to provide a comprehensive understanding of mortality patterns that can inform public health strategies.
This dataset contains mortality data related to various diseases from 1990 to 2019, spanning nearly 30 years. It highlights the rise of lifestyle-related illnesses, which have emerged as a consequence of modern advancements, affecting every aspect of life. The recent pandemic has underscored how such health crises can reshape the world, but beyond that, numerous other illnesses continue to impact global society and influence decision-makers. This notebook aims to analyze the global impact of these "new age" diseases using 30 years of historical data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
import tensorflow as tf
from tensorflow import keras
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
# Set the plot style
sns.set(style="whitegrid")
df = pd.read_csv('cause_of_deaths.csv')
# Display the first few rows of the dataset
df.head()
| Country/Territory | Code | Year | Meningitis | Alzheimer's Disease and Other Dementias | Parkinson's Disease | Nutritional Deficiencies | Malaria | Drowning | Interpersonal Violence | ... | Diabetes Mellitus | Chronic Kidney Disease | Poisonings | Protein-Energy Malnutrition | Road Injuries | Chronic Respiratory Diseases | Cirrhosis and Other Chronic Liver Diseases | Digestive Diseases | Fire, Heat, and Hot Substances | Acute Hepatitis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | AFG | 1990 | 2159 | 1116 | 371 | 2087 | 93 | 1370 | 1538 | ... | 2108 | 3709 | 338 | 2054 | 4154 | 5945 | 2673 | 5005 | 323 | 2985 |
| 1 | Afghanistan | AFG | 1991 | 2218 | 1136 | 374 | 2153 | 189 | 1391 | 2001 | ... | 2120 | 3724 | 351 | 2119 | 4472 | 6050 | 2728 | 5120 | 332 | 3092 |
| 2 | Afghanistan | AFG | 1992 | 2475 | 1162 | 378 | 2441 | 239 | 1514 | 2299 | ... | 2153 | 3776 | 386 | 2404 | 5106 | 6223 | 2830 | 5335 | 360 | 3325 |
| 3 | Afghanistan | AFG | 1993 | 2812 | 1187 | 384 | 2837 | 108 | 1687 | 2589 | ... | 2195 | 3862 | 425 | 2797 | 5681 | 6445 | 2943 | 5568 | 396 | 3601 |
| 4 | Afghanistan | AFG | 1994 | 3027 | 1211 | 391 | 3081 | 211 | 1809 | 2849 | ... | 2231 | 3932 | 451 | 3038 | 6001 | 6664 | 3027 | 5739 | 420 | 3816 |
5 rows × 34 columns
df.columns
Index(['Country/Territory', 'Code', 'Year', 'Meningitis',
'Alzheimer's Disease and Other Dementias', 'Parkinson's Disease',
'Nutritional Deficiencies', 'Malaria', 'Drowning',
'Interpersonal Violence', 'Maternal Disorders', 'HIV/AIDS',
'Drug Use Disorders', 'Tuberculosis', 'Cardiovascular Diseases',
'Lower Respiratory Infections', 'Neonatal Disorders',
'Alcohol Use Disorders', 'Self-harm', 'Exposure to Forces of Nature',
'Diarrheal Diseases', 'Environmental Heat and Cold Exposure',
'Neoplasms', 'Conflict and Terrorism', 'Diabetes Mellitus',
'Chronic Kidney Disease', 'Poisonings', 'Protein-Energy Malnutrition',
'Road Injuries', 'Chronic Respiratory Diseases',
'Cirrhosis and Other Chronic Liver Diseases', 'Digestive Diseases',
'Fire, Heat, and Hot Substances', 'Acute Hepatitis'],
dtype='object')
df.tail()
| Country/Territory | Code | Year | Meningitis | Alzheimer's Disease and Other Dementias | Parkinson's Disease | Nutritional Deficiencies | Malaria | Drowning | Interpersonal Violence | ... | Diabetes Mellitus | Chronic Kidney Disease | Poisonings | Protein-Energy Malnutrition | Road Injuries | Chronic Respiratory Diseases | Cirrhosis and Other Chronic Liver Diseases | Digestive Diseases | Fire, Heat, and Hot Substances | Acute Hepatitis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6115 | Zimbabwe | ZWE | 2015 | 1439 | 754 | 215 | 3019 | 2518 | 770 | 1302 | ... | 3176 | 2108 | 381 | 2990 | 2373 | 2751 | 1956 | 4202 | 632 | 146 |
| 6116 | Zimbabwe | ZWE | 2016 | 1457 | 767 | 219 | 3056 | 2050 | 801 | 1342 | ... | 3259 | 2160 | 393 | 3027 | 2436 | 2788 | 1962 | 4264 | 648 | 146 |
| 6117 | Zimbabwe | ZWE | 2017 | 1460 | 781 | 223 | 2990 | 2116 | 818 | 1363 | ... | 3313 | 2196 | 398 | 2962 | 2473 | 2818 | 2007 | 4342 | 654 | 144 |
| 6118 | Zimbabwe | ZWE | 2018 | 1450 | 795 | 227 | 2918 | 2088 | 825 | 1396 | ... | 3381 | 2240 | 400 | 2890 | 2509 | 2849 | 2030 | 4377 | 657 | 139 |
| 6119 | Zimbabwe | ZWE | 2019 | 1450 | 812 | 232 | 2884 | 2068 | 827 | 1434 | ... | 3460 | 2292 | 405 | 2855 | 2554 | 2891 | 2065 | 4437 | 662 | 136 |
5 rows × 34 columns
Let's explore the country columns
df["Country/Territory"].describe()
count 6120 unique 204 top Afghanistan freq 30 Name: Country/Territory, dtype: object
df["Country/Territory"].value_counts()
Country/Territory
Afghanistan 30
Papua New Guinea 30
Niue 30
North Korea 30
North Macedonia 30
..
Greenland 30
Grenada 30
Guam 30
Guatemala 30
Zimbabwe 30
Name: count, Length: 204, dtype: int64
# Display descriptive statistics of the dataset
df.describe()
| Year | Meningitis | Alzheimer's Disease and Other Dementias | Parkinson's Disease | Nutritional Deficiencies | Malaria | Drowning | Interpersonal Violence | Maternal Disorders | HIV/AIDS | ... | Diabetes Mellitus | Chronic Kidney Disease | Poisonings | Protein-Energy Malnutrition | Road Injuries | Chronic Respiratory Diseases | Cirrhosis and Other Chronic Liver Diseases | Digestive Diseases | Fire, Heat, and Hot Substances | Acute Hepatitis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | ... | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 |
| mean | 2004.50 | 1719.70 | 4864.19 | 1173.17 | 2253.60 | 4140.96 | 1683.33 | 2083.80 | 1262.59 | 5941.90 | ... | 5138.70 | 4724.13 | 425.01 | 1965.99 | 5930.80 | 17092.37 | 6124.07 | 10725.27 | 588.71 | 618.43 |
| std | 8.66 | 6672.01 | 18220.66 | 4616.16 | 10483.63 | 18427.75 | 8877.02 | 6917.01 | 6057.97 | 21011.96 | ... | 16773.08 | 16470.43 | 2022.64 | 8256.00 | 24097.78 | 105157.18 | 20688.12 | 37228.05 | 2128.60 | 4186.02 |
| min | 1990.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 1997.00 | 15.00 | 90.00 | 27.00 | 9.00 | 0.00 | 34.00 | 40.00 | 5.00 | 11.00 | ... | 236.00 | 145.75 | 6.00 | 5.00 | 174.75 | 289.00 | 154.00 | 284.00 | 17.00 | 2.00 |
| 50% | 2004.50 | 109.00 | 666.50 | 164.00 | 119.00 | 0.00 | 177.00 | 265.00 | 54.00 | 136.00 | ... | 1087.00 | 822.00 | 52.50 | 92.00 | 966.50 | 1689.00 | 1210.00 | 2185.00 | 126.00 | 15.00 |
| 75% | 2012.00 | 847.25 | 2456.25 | 609.25 | 1167.25 | 393.00 | 698.00 | 877.00 | 734.00 | 1879.00 | ... | 2954.00 | 2922.50 | 254.00 | 1042.50 | 3435.25 | 5249.75 | 3547.25 | 6080.00 | 450.00 | 160.00 |
| max | 2019.00 | 98358.00 | 320715.00 | 76990.00 | 268223.00 | 280604.00 | 153773.00 | 69640.00 | 107929.00 | 305491.00 | ... | 273089.00 | 222922.00 | 30883.00 | 202241.00 | 329237.00 | 1366039.00 | 270037.00 | 464914.00 | 25876.00 | 64305.00 |
8 rows × 32 columns
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Year | 6120.00 | 2004.50 | 8.66 | 1990.00 | 1997.00 | 2004.50 | 2012.00 | 2019.00 |
| Meningitis | 6120.00 | 1719.70 | 6672.01 | 0.00 | 15.00 | 109.00 | 847.25 | 98358.00 |
| Alzheimer's Disease and Other Dementias | 6120.00 | 4864.19 | 18220.66 | 0.00 | 90.00 | 666.50 | 2456.25 | 320715.00 |
| Parkinson's Disease | 6120.00 | 1173.17 | 4616.16 | 0.00 | 27.00 | 164.00 | 609.25 | 76990.00 |
| Nutritional Deficiencies | 6120.00 | 2253.60 | 10483.63 | 0.00 | 9.00 | 119.00 | 1167.25 | 268223.00 |
| Malaria | 6120.00 | 4140.96 | 18427.75 | 0.00 | 0.00 | 0.00 | 393.00 | 280604.00 |
| Drowning | 6120.00 | 1683.33 | 8877.02 | 0.00 | 34.00 | 177.00 | 698.00 | 153773.00 |
| Interpersonal Violence | 6120.00 | 2083.80 | 6917.01 | 0.00 | 40.00 | 265.00 | 877.00 | 69640.00 |
| Maternal Disorders | 6120.00 | 1262.59 | 6057.97 | 0.00 | 5.00 | 54.00 | 734.00 | 107929.00 |
| HIV/AIDS | 6120.00 | 5941.90 | 21011.96 | 0.00 | 11.00 | 136.00 | 1879.00 | 305491.00 |
| Drug Use Disorders | 6120.00 | 434.01 | 2898.76 | 0.00 | 3.00 | 20.00 | 129.00 | 65717.00 |
| Tuberculosis | 6120.00 | 7491.93 | 39549.98 | 0.00 | 35.00 | 417.00 | 2924.25 | 657515.00 |
| Cardiovascular Diseases | 6120.00 | 73160.45 | 291577.54 | 4.00 | 2028.00 | 11742.00 | 42546.50 | 4584273.00 |
| Lower Respiratory Infections | 6120.00 | 13687.91 | 48031.72 | 0.00 | 345.00 | 2126.50 | 10161.25 | 690913.00 |
| Neonatal Disorders | 6120.00 | 12558.94 | 56058.37 | 0.00 | 131.00 | 916.00 | 7419.75 | 852761.00 |
| Alcohol Use Disorders | 6120.00 | 787.42 | 3545.82 | 0.00 | 9.00 | 80.00 | 316.00 | 55200.00 |
| Self-harm | 6120.00 | 3874.83 | 18425.62 | 0.00 | 94.00 | 533.00 | 1882.25 | 220357.00 |
| Exposure to Forces of Nature | 6120.00 | 243.49 | 4717.10 | 0.00 | 0.00 | 0.00 | 12.00 | 222641.00 |
| Diarrheal Diseases | 6120.00 | 10822.80 | 65416.17 | 0.00 | 20.00 | 296.50 | 3946.75 | 1119477.00 |
| Environmental Heat and Cold Exposure | 6120.00 | 292.30 | 1704.47 | 0.00 | 2.00 | 21.00 | 109.00 | 29048.00 |
| Neoplasms | 6120.00 | 37542.24 | 161558.37 | 1.00 | 809.75 | 5629.50 | 20147.75 | 2716551.00 |
| Conflict and Terrorism | 6120.00 | 538.24 | 7033.31 | 0.00 | 0.00 | 0.00 | 23.00 | 503532.00 |
| Diabetes Mellitus | 6120.00 | 5138.70 | 16773.08 | 1.00 | 236.00 | 1087.00 | 2954.00 | 273089.00 |
| Chronic Kidney Disease | 6120.00 | 4724.13 | 16470.43 | 0.00 | 145.75 | 822.00 | 2922.50 | 222922.00 |
| Poisonings | 6120.00 | 425.01 | 2022.64 | 0.00 | 6.00 | 52.50 | 254.00 | 30883.00 |
| Protein-Energy Malnutrition | 6120.00 | 1965.99 | 8256.00 | 0.00 | 5.00 | 92.00 | 1042.50 | 202241.00 |
| Road Injuries | 6120.00 | 5930.80 | 24097.78 | 0.00 | 174.75 | 966.50 | 3435.25 | 329237.00 |
| Chronic Respiratory Diseases | 6120.00 | 17092.37 | 105157.18 | 1.00 | 289.00 | 1689.00 | 5249.75 | 1366039.00 |
| Cirrhosis and Other Chronic Liver Diseases | 6120.00 | 6124.07 | 20688.12 | 0.00 | 154.00 | 1210.00 | 3547.25 | 270037.00 |
| Digestive Diseases | 6120.00 | 10725.27 | 37228.05 | 0.00 | 284.00 | 2185.00 | 6080.00 | 464914.00 |
| Fire, Heat, and Hot Substances | 6120.00 | 588.71 | 2128.60 | 0.00 | 17.00 | 126.00 | 450.00 | 25876.00 |
| Acute Hepatitis | 6120.00 | 618.43 | 4186.02 | 0.00 | 2.00 | 15.00 | 160.00 | 64305.00 |
# Get the Statistical summary of the category columns
df.describe(include='object').T
| count | unique | top | freq | |
|---|---|---|---|---|
| Country/Territory | 6120 | 204 | Afghanistan | 30 |
| Code | 6120 | 204 | AFG | 30 |
# Display information about data types and missing values
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6120 entries, 0 to 6119 Data columns (total 34 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country/Territory 6120 non-null object 1 Code 6120 non-null object 2 Year 6120 non-null int64 3 Meningitis 6120 non-null int64 4 Alzheimer's Disease and Other Dementias 6120 non-null int64 5 Parkinson's Disease 6120 non-null int64 6 Nutritional Deficiencies 6120 non-null int64 7 Malaria 6120 non-null int64 8 Drowning 6120 non-null int64 9 Interpersonal Violence 6120 non-null int64 10 Maternal Disorders 6120 non-null int64 11 HIV/AIDS 6120 non-null int64 12 Drug Use Disorders 6120 non-null int64 13 Tuberculosis 6120 non-null int64 14 Cardiovascular Diseases 6120 non-null int64 15 Lower Respiratory Infections 6120 non-null int64 16 Neonatal Disorders 6120 non-null int64 17 Alcohol Use Disorders 6120 non-null int64 18 Self-harm 6120 non-null int64 19 Exposure to Forces of Nature 6120 non-null int64 20 Diarrheal Diseases 6120 non-null int64 21 Environmental Heat and Cold Exposure 6120 non-null int64 22 Neoplasms 6120 non-null int64 23 Conflict and Terrorism 6120 non-null int64 24 Diabetes Mellitus 6120 non-null int64 25 Chronic Kidney Disease 6120 non-null int64 26 Poisonings 6120 non-null int64 27 Protein-Energy Malnutrition 6120 non-null int64 28 Road Injuries 6120 non-null int64 29 Chronic Respiratory Diseases 6120 non-null int64 30 Cirrhosis and Other Chronic Liver Diseases 6120 non-null int64 31 Digestive Diseases 6120 non-null int64 32 Fire, Heat, and Hot Substances 6120 non-null int64 33 Acute Hepatitis 6120 non-null int64 dtypes: int64(32), object(2) memory usage: 1.6+ MB
# Check for missing values
df.isnull().sum()
Country/Territory 0 Code 0 Year 0 Meningitis 0 Alzheimer's Disease and Other Dementias 0 Parkinson's Disease 0 Nutritional Deficiencies 0 Malaria 0 Drowning 0 Interpersonal Violence 0 Maternal Disorders 0 HIV/AIDS 0 Drug Use Disorders 0 Tuberculosis 0 Cardiovascular Diseases 0 Lower Respiratory Infections 0 Neonatal Disorders 0 Alcohol Use Disorders 0 Self-harm 0 Exposure to Forces of Nature 0 Diarrheal Diseases 0 Environmental Heat and Cold Exposure 0 Neoplasms 0 Conflict and Terrorism 0 Diabetes Mellitus 0 Chronic Kidney Disease 0 Poisonings 0 Protein-Energy Malnutrition 0 Road Injuries 0 Chronic Respiratory Diseases 0 Cirrhosis and Other Chronic Liver Diseases 0 Digestive Diseases 0 Fire, Heat, and Hot Substances 0 Acute Hepatitis 0 dtype: int64
# Check for duplicate rows
df.duplicated().sum()
0
# Display the data types of each column
df.dtypes
Country/Territory object Code object Year int64 Meningitis int64 Alzheimer's Disease and Other Dementias int64 Parkinson's Disease int64 Nutritional Deficiencies int64 Malaria int64 Drowning int64 Interpersonal Violence int64 Maternal Disorders int64 HIV/AIDS int64 Drug Use Disorders int64 Tuberculosis int64 Cardiovascular Diseases int64 Lower Respiratory Infections int64 Neonatal Disorders int64 Alcohol Use Disorders int64 Self-harm int64 Exposure to Forces of Nature int64 Diarrheal Diseases int64 Environmental Heat and Cold Exposure int64 Neoplasms int64 Conflict and Terrorism int64 Diabetes Mellitus int64 Chronic Kidney Disease int64 Poisonings int64 Protein-Energy Malnutrition int64 Road Injuries int64 Chronic Respiratory Diseases int64 Cirrhosis and Other Chronic Liver Diseases int64 Digestive Diseases int64 Fire, Heat, and Hot Substances int64 Acute Hepatitis int64 dtype: object
# Total no.of records
len(df)
6120
df.shape
(6120, 34)
df['Year'].nunique()
# no. of years
30
# Unique no. years
df['Year'].unique()
array([1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019], dtype=int64)
# Check for number of unique records present in the data
df.nunique(axis = 0)
Country/Territory 204 Code 204 Year 30 Meningitis 2020 Alzheimer's Disease and Other Dementias 3037 Parkinson's Disease 1817 Nutritional Deficiencies 2147 Malaria 1723 Drowning 1875 Interpersonal Violence 2142 Maternal Disorders 1818 HIV/AIDS 2412 Drug Use Disorders 876 Tuberculosis 2843 Cardiovascular Diseases 5225 Lower Respiratory Infections 4106 Neonatal Disorders 3553 Alcohol Use Disorders 1287 Self-harm 2758 Exposure to Forces of Nature 478 Diarrheal Diseases 2874 Environmental Heat and Cold Exposure 714 Neoplasms 4814 Conflict and Terrorism 918 Diabetes Mellitus 3366 Chronic Kidney Disease 3246 Poisonings 1087 Protein-Energy Malnutrition 2091 Road Injuries 3393 Chronic Respiratory Diseases 3803 Cirrhosis and Other Chronic Liver Diseases 3443 Digestive Diseases 4023 Fire, Heat, and Hot Substances 1406 Acute Hepatitis 1059 dtype: int64
df.head()
| Country/Territory | Code | Year | Meningitis | Alzheimer's Disease and Other Dementias | Parkinson's Disease | Nutritional Deficiencies | Malaria | Drowning | Interpersonal Violence | ... | Diabetes Mellitus | Chronic Kidney Disease | Poisonings | Protein-Energy Malnutrition | Road Injuries | Chronic Respiratory Diseases | Cirrhosis and Other Chronic Liver Diseases | Digestive Diseases | Fire, Heat, and Hot Substances | Acute Hepatitis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | AFG | 1990 | 2159 | 1116 | 371 | 2087 | 93 | 1370 | 1538 | ... | 2108 | 3709 | 338 | 2054 | 4154 | 5945 | 2673 | 5005 | 323 | 2985 |
| 1 | Afghanistan | AFG | 1991 | 2218 | 1136 | 374 | 2153 | 189 | 1391 | 2001 | ... | 2120 | 3724 | 351 | 2119 | 4472 | 6050 | 2728 | 5120 | 332 | 3092 |
| 2 | Afghanistan | AFG | 1992 | 2475 | 1162 | 378 | 2441 | 239 | 1514 | 2299 | ... | 2153 | 3776 | 386 | 2404 | 5106 | 6223 | 2830 | 5335 | 360 | 3325 |
| 3 | Afghanistan | AFG | 1993 | 2812 | 1187 | 384 | 2837 | 108 | 1687 | 2589 | ... | 2195 | 3862 | 425 | 2797 | 5681 | 6445 | 2943 | 5568 | 396 | 3601 |
| 4 | Afghanistan | AFG | 1994 | 3027 | 1211 | 391 | 3081 | 211 | 1809 | 2849 | ... | 2231 | 3932 | 451 | 3038 | 6001 | 6664 | 3027 | 5739 | 420 | 3816 |
5 rows × 34 columns
# Correlation of various causes of death against year
# Select only numeric columns
numeric_df = df.select_dtypes(include=[float, int])
# Compute the correlation matrix
correlation_matrix = numeric_df.corr()
# Display correlation of all numeric columns with 'Year'
correlation_with_year = correlation_matrix['Year']
print(correlation_with_year)
Year 1.00 Meningitis -0.04 Alzheimer's Disease and Other Dementias 0.08 Parkinson's Disease 0.07 Nutritional Deficiencies -0.08 Malaria -0.02 Drowning -0.04 Interpersonal Violence -0.00 Maternal Disorders -0.03 HIV/AIDS 0.02 Drug Use Disorders 0.02 Tuberculosis -0.03 Cardiovascular Diseases 0.03 Lower Respiratory Infections -0.03 Neonatal Disorders -0.03 Alcohol Use Disorders 0.01 Self-harm -0.00 Exposure to Forces of Nature -0.01 Diarrheal Diseases -0.03 Environmental Heat and Cold Exposure -0.02 Neoplasms 0.04 Conflict and Terrorism -0.01 Diabetes Mellitus 0.07 Chronic Kidney Disease 0.07 Poisonings -0.01 Protein-Energy Malnutrition -0.09 Road Injuries 0.01 Chronic Respiratory Diseases 0.01 Cirrhosis and Other Chronic Liver Diseases 0.03 Digestive Diseases 0.03 Fire, Heat, and Hot Substances -0.01 Acute Hepatitis -0.03 Name: Year, dtype: float64
1: positive correlation (as Year increases, the cause of death increases).
-1: negative correlation (as Year increases, the cause of death decreases).
0: No correlation (no relationship between Year and the cause of death).
# Total no.of Countries
df['Country/Territory'].nunique()
204
# Total no.of year data provided for each country
df['Country/Territory'].value_counts()
Country/Territory
Afghanistan 30
Papua New Guinea 30
Niue 30
North Korea 30
North Macedonia 30
..
Greenland 30
Grenada 30
Guam 30
Guatemala 30
Zimbabwe 30
Name: count, Length: 204, dtype: int64
30 year data is provided for Each Country
df["Year"].value_counts()
Year 1990 204 1991 204 2018 204 2017 204 2016 204 2015 204 2014 204 2013 204 2012 204 2011 204 2010 204 2009 204 2008 204 2007 204 2006 204 2005 204 2004 204 2003 204 2002 204 2001 204 2000 204 1999 204 1998 204 1997 204 1996 204 1995 204 1994 204 1993 204 1992 204 2019 204 Name: count, dtype: int64
Which country has the highest number of deaths due to cardiovascular diseases?
car_disease = df.groupby("Country/Territory")["Cardiovascular Diseases"].sum().sort_values(ascending=False).head(20)
car_disease
Country/Territory China 100505973 India 52994710 Russia 33903781 United States 26438346 Indonesia 13587011 Ukraine 13053052 Germany 10819770 Brazil 9589019 Japan 9210437 Pakistan 7745192 Italy 6614384 United Kingdom 6603062 Bangladesh 6123691 Egypt 5995471 Vietnam 5323920 Poland 5233134 France 4729313 Romania 4474916 Nigeria 4176488 Turkey 4167835 Name: Cardiovascular Diseases, dtype: int64
Global Causes of Death Distribution of the top causes of death.:
# Summing the causes of death across all countries and years
global_causes = df.drop(columns=['Country/Territory', 'Code', 'Year']).sum().sort_values(ascending=False)
# Plot the top 10 causes of death globally
plt.figure(figsize=(12,6))
sns.barplot(x=global_causes.index[:10], y=global_causes.values[:10], palette='Blues_d')
plt.xticks(rotation=90)
plt.title('Top 10 Global Causes of Death')
plt.ylabel('Total Deaths')
plt.xlabel('Cause of Death')
plt.show()
Trend Analysis Over Time (how a specific cause of death (e.g., Cardiovascular Diseases) has changed over time.):
# Group by year and sum up deaths for Cardiovascular Diseases
cardio_trend = df.groupby('Year')['Cardiovascular Diseases'].sum()
# Plotting the trend
plt.figure(figsize=(10,5))
sns.lineplot(x=cardio_trend.index, y=cardio_trend.values)
plt.title('Cardiovascular Diseases Trend Over Time')
plt.ylabel('Total Deaths')
plt.xlabel('Year')
plt.show()
Regional Comparison (compare regions by creating a bar chart to visualize how causes of death differ across different countries.):
# Group by Country and sum up deaths for all causes
regional_comparison = df.groupby('Country/Territory').sum().drop(columns=['Year'])
# Sort by Cardiovascular Diseases
regional_comparison = regional_comparison.sort_values(by='Cardiovascular Diseases', ascending=False)
# Plotting the top 10 countries for Cardiovascular Diseases
plt.figure(figsize=(12,6))
sns.barplot(x=regional_comparison.index[:10], y=regional_comparison['Cardiovascular Diseases'][:10], palette='Reds_d')
plt.xticks(rotation=90)
plt.title('Top 10 Countries with Cardiovascular Diseases Deaths')
plt.ylabel('Total Deaths')
plt.xlabel('Country/Territory')
plt.show()
plt.figure(figsize=(12, 6))
sns.barplot(x=car_disease, y=car_disease.index, palette='viridis', orient='h')
plt.title('Sum of Cardiovascular Diseases by Country')
plt.xlabel('Country/Territory')
plt.ylabel('Sum of Cardiovascular Diseases')
plt.xticks(ha='right')
plt.show()
for country in car_disease.index[:5] :
selected_country_data = df[df['Country/Territory'] == country]
sns.lineplot(x='Year', y='Cardiovascular Diseases', data=selected_country_data, label=country, marker='o', markersize=8)
plt.title('Disease Counts Over Years - Top 5 Countries')
plt.xlabel('Year')
plt.ylabel('Cardiovascular Diseases Count')
plt.legend(title='Country/Territory', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
disease_data = df.groupby("Year")["Cardiovascular Diseases"].sum().reset_index()
disease_data
| Year | Cardiovascular Diseases | |
|---|---|---|
| 0 | 1990 | 12062179 |
| 1 | 1991 | 12220282 |
| 2 | 1992 | 12437979 |
| 3 | 1993 | 12802108 |
| 4 | 1994 | 13026289 |
| 5 | 1995 | 13129252 |
| 6 | 1996 | 13213565 |
| 7 | 1997 | 13339902 |
| 8 | 1998 | 13461489 |
| 9 | 1999 | 13720763 |
| 10 | 2000 | 13957078 |
| 11 | 2001 | 14185571 |
| 12 | 2002 | 14501696 |
| 13 | 2003 | 14710723 |
| 14 | 2004 | 14745985 |
| 15 | 2005 | 14995528 |
| 16 | 2006 | 14991661 |
| 17 | 2007 | 15117363 |
| 18 | 2008 | 15402070 |
| 19 | 2009 | 15552545 |
| 20 | 2010 | 15838151 |
| 21 | 2011 | 16038263 |
| 22 | 2012 | 16245243 |
| 23 | 2013 | 16490053 |
| 24 | 2014 | 16715810 |
| 25 | 2015 | 17089707 |
| 26 | 2016 | 17398709 |
| 27 | 2017 | 17685890 |
| 28 | 2018 | 18113910 |
| 29 | 2019 | 18552218 |
sns.lineplot(x='Year', y='Cardiovascular Diseases', data=disease_data, marker='o', markersize=8);
What is the trend in the number of deaths caused by Alzheimer’s disease and other dementias over the years?
Dementias_trend = df.groupby("Year")["Alzheimer's Disease and Other Dementias"].sum().reset_index()
Dementias_trend
| Year | Alzheimer's Disease and Other Dementias | |
|---|---|---|
| 0 | 1990 | 560616 |
| 1 | 1991 | 583166 |
| 2 | 1992 | 605894 |
| 3 | 1993 | 629571 |
| 4 | 1994 | 652176 |
| 5 | 1995 | 674815 |
| 6 | 1996 | 696665 |
| 7 | 1997 | 717342 |
| 8 | 1998 | 738768 |
| 9 | 1999 | 761620 |
| 10 | 2000 | 786615 |
| 11 | 2001 | 814526 |
| 12 | 2002 | 845695 |
| 13 | 2003 | 877011 |
| 14 | 2004 | 909148 |
| 15 | 2005 | 945619 |
| 16 | 2006 | 982308 |
| 17 | 2007 | 1022057 |
| 18 | 2008 | 1065297 |
| 19 | 2009 | 1109405 |
| 20 | 2010 | 1155944 |
| 21 | 2011 | 1201138 |
| 22 | 2012 | 1247515 |
| 23 | 2013 | 1294701 |
| 24 | 2014 | 1343756 |
| 25 | 2015 | 1394942 |
| 26 | 2016 | 1451840 |
| 27 | 2017 | 1509646 |
| 28 | 2018 | 1568617 |
| 29 | 2019 | 1622426 |
sns.lineplot(x='Year', y="Alzheimer's Disease and Other Dementias", data=Dementias_trend, marker='o', markersize=8);
Which country has the highest number of deaths caused by malaria?
malaria_disease = df.groupby("Country/Territory")["Malaria"].sum().sort_values(ascending=False).head(20)
malaria_disease
Country/Territory Nigeria 6422063 Democratic Republic of Congo 2557219 India 2439244 Uganda 1265629 Burkina Faso 950762 Cote d'Ivoire 941597 Mozambique 817948 Tanzania 800490 Ghana 721339 Mali 711087 Niger 693962 Cameroon 614095 Ethiopia 453985 Malawi 404288 Sierra Leone 394491 Guinea 362660 Bangladesh 349375 Burundi 320767 Angola 317069 Benin 316834 Name: Malaria, dtype: int64
plt.figure(figsize=(12, 6))
sns.barplot(x=malaria_disease, y=malaria_disease.index, palette='viridis', orient='h')
plt.title('Sum of Malaria Diseases by Country')
plt.xlabel('Country/Territory')
plt.ylabel('Sum of Malaria Diseases')
plt.xticks(ha='right')
plt.show()
All coutries suffering from Malaria are in Africa except for India and Bangladesh, which makes sense
df.groupby("Country/Territory").sum()
| Code | Year | Meningitis | Alzheimer's Disease and Other Dementias | Parkinson's Disease | Nutritional Deficiencies | Malaria | Drowning | Interpersonal Violence | Maternal Disorders | ... | Diabetes Mellitus | Chronic Kidney Disease | Poisonings | Protein-Energy Malnutrition | Road Injuries | Chronic Respiratory Diseases | Cirrhosis and Other Chronic Liver Diseases | Digestive Diseases | Fire, Heat, and Hot Substances | Acute Hepatitis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Country/Territory | |||||||||||||||||||||
| Afghanistan | AFGAFGAFGAFGAFGAFGAFGAFGAFGAFGAFGAFGAFGAFGAFGA... | 60135 | 78666 | 41998 | 13397 | 71453 | 13924 | 56536 | 108228 | 129621 | ... | 93207 | 134676 | 14530 | 70163 | 208331 | 209857 | 98419 | 186959 | 13559 | 98108 |
| Albania | ALBALBALBALBALBALBALBALBALBALBALBALBALBALBALBA... | 60135 | 1323 | 16549 | 4491 | 569 | 0 | 2397 | 5242 | 246 | ... | 4055 | 7636 | 500 | 526 | 8522 | 22632 | 8717 | 14907 | 636 | 44 |
| Algeria | DZADZADZADZADZADZADZADZADZADZADZADZADZADZADZAD... | 60135 | 15685 | 86914 | 22943 | 7138 | 70 | 24273 | 16702 | 29475 | ... | 89035 | 154666 | 12337 | 6407 | 369395 | 168453 | 91927 | 146527 | 27628 | 10492 |
| American Samoa | ASMASMASMASMASMASMASMASMASMASMASMASMASMASMASMA... | 60135 | 30 | 143 | 69 | 60 | 0 | 120 | 101 | 30 | ... | 970 | 512 | 0 | 60 | 164 | 612 | 181 | 341 | 0 | 0 |
| Andorra | ANDANDANDANDANDANDANDANDANDANDANDANDANDANDANDA... | 60135 | 0 | 614 | 137 | 0 | 0 | 0 | 15 | 0 | ... | 198 | 292 | 0 | 0 | 259 | 838 | 283 | 560 | 0 | 30 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Venezuela | VENVENVENVENVENVENVENVENVENVENVENVENVENVENVENV... | 60135 | 11615 | 108735 | 18573 | 22554 | 3726 | 20273 | 266071 | 12739 | ... | 175790 | 161667 | 2607 | 21347 | 175036 | 122198 | 91720 | 168365 | 4949 | 1109 |
| Vietnam | VNMVNMVNMVNMVNMVNMVNMVNMVNMVNMVNMVNMVNMVNMVNMV... | 60135 | 38559 | 369363 | 83322 | 48613 | 17311 | 214356 | 47981 | 13167 | ... | 544222 | 396874 | 34681 | 7366 | 594980 | 911787 | 527192 | 735817 | 17380 | 30650 |
| Yemen | YEMYEMYEMYEMYEMYEMYEMYEMYEMYEMYEMYEMYEMYEMYEMY... | 60135 | 21095 | 31045 | 7188 | 68939 | 143463 | 27994 | 17918 | 53611 | ... | 30812 | 52119 | 12561 | 66731 | 278327 | 126525 | 64136 | 111536 | 23871 | 26532 |
| Zambia | ZMBZMBZMBZMBZMBZMBZMBZMBZMBZMBZMBZMBZMBZMBZMBZ... | 60135 | 98886 | 13473 | 4054 | 95913 | 205529 | 12809 | 30065 | 28395 | ... | 54098 | 41751 | 9056 | 92915 | 56976 | 59173 | 100581 | 147640 | 9476 | 8846 |
| Zimbabwe | ZWEZWEZWEZWEZWEZWEZWEZWEZWEZWEZWEZWEZWEZWEZWEZ... | 60135 | 41238 | 20017 | 5764 | 66723 | 118728 | 18169 | 32741 | 29802 | ... | 71175 | 49952 | 9113 | 65942 | 67207 | 71774 | 55027 | 108691 | 14718 | 3778 |
204 rows × 33 columns
The high mortality rate in Nigeria is a result of its weak health systems and poverty
What is the percentage of deaths caused by lower respiratory infections in the total number of deaths?
deaths_causes = df.iloc[:, 3:].sum().sort_values(ascending = False)
deaths_causes_per = (deaths_causes.div(deaths_causes.sum()) * 100).round(2)
deaths_causes_per
Cardiovascular Diseases 30.50 Neoplasms 15.65 Chronic Respiratory Diseases 7.13 Lower Respiratory Infections 5.71 Neonatal Disorders 5.24 Diarrheal Diseases 4.51 Digestive Diseases 4.47 Tuberculosis 3.12 Cirrhosis and Other Chronic Liver Diseases 2.55 HIV/AIDS 2.48 Road Injuries 2.47 Diabetes Mellitus 2.14 Alzheimer's Disease and Other Dementias 2.03 Chronic Kidney Disease 1.97 Malaria 1.73 Self-harm 1.62 Nutritional Deficiencies 0.94 Interpersonal Violence 0.87 Protein-Energy Malnutrition 0.82 Meningitis 0.72 Drowning 0.70 Maternal Disorders 0.53 Parkinson's Disease 0.49 Alcohol Use Disorders 0.33 Acute Hepatitis 0.26 Fire, Heat, and Hot Substances 0.25 Conflict and Terrorism 0.22 Drug Use Disorders 0.18 Poisonings 0.18 Environmental Heat and Cold Exposure 0.12 Exposure to Forces of Nature 0.10 dtype: float64
Heart and Circulatory Diseases, Tumors and Respiratory Diseases constitutes more than 60% of the total number of deaths around the world
plt.figure(figsize=(12, 10))
sns.barplot(x=deaths_causes, y=deaths_causes.index, palette='viridis', orient='h')
plt.title('Causes of Death')
plt.xlabel('Counts')
plt.xticks(ha='right')
plt.show()
df.head()
| Country/Territory | Code | Year | Meningitis | Alzheimer's Disease and Other Dementias | Parkinson's Disease | Nutritional Deficiencies | Malaria | Drowning | Interpersonal Violence | ... | Diabetes Mellitus | Chronic Kidney Disease | Poisonings | Protein-Energy Malnutrition | Road Injuries | Chronic Respiratory Diseases | Cirrhosis and Other Chronic Liver Diseases | Digestive Diseases | Fire, Heat, and Hot Substances | Acute Hepatitis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | AFG | 1990 | 2159 | 1116 | 371 | 2087 | 93 | 1370 | 1538 | ... | 2108 | 3709 | 338 | 2054 | 4154 | 5945 | 2673 | 5005 | 323 | 2985 |
| 1 | Afghanistan | AFG | 1991 | 2218 | 1136 | 374 | 2153 | 189 | 1391 | 2001 | ... | 2120 | 3724 | 351 | 2119 | 4472 | 6050 | 2728 | 5120 | 332 | 3092 |
| 2 | Afghanistan | AFG | 1992 | 2475 | 1162 | 378 | 2441 | 239 | 1514 | 2299 | ... | 2153 | 3776 | 386 | 2404 | 5106 | 6223 | 2830 | 5335 | 360 | 3325 |
| 3 | Afghanistan | AFG | 1993 | 2812 | 1187 | 384 | 2837 | 108 | 1687 | 2589 | ... | 2195 | 3862 | 425 | 2797 | 5681 | 6445 | 2943 | 5568 | 396 | 3601 |
| 4 | Afghanistan | AFG | 1994 | 3027 | 1211 | 391 | 3081 | 211 | 1809 | 2849 | ... | 2231 | 3932 | 451 | 3038 | 6001 | 6664 | 3027 | 5739 | 420 | 3816 |
5 rows × 34 columns
cause_of_deaths = ['Meningitis',
'Alzheimer\'s Disease and Other Dementias', 'Parkinson\'s Disease',
'Nutritional Deficiencies', 'Malaria', 'Drowning',
'Interpersonal Violence', 'Maternal Disorders', 'HIV/AIDS',
'Drug Use Disorders', 'Tuberculosis', 'Cardiovascular Diseases',
'Lower Respiratory Infections', 'Neonatal Disorders',
'Alcohol Use Disorders', 'Self-harm', 'Exposure to Forces of Nature',
'Diarrheal Diseases', 'Environmental Heat and Cold Exposure',
'Neoplasms', 'Conflict and Terrorism', 'Diabetes Mellitus',
'Chronic Kidney Disease', 'Poisonings', 'Protein-Energy Malnutrition',
'Road Injuries', 'Chronic Respiratory Diseases',
'Cirrhosis and Other Chronic Liver Diseases', 'Digestive Diseases',
'Fire, Heat, and Hot Substances', 'Acute Hepatitis']
# Creating a new column for 'Total_no_of_Deaths' for individual Country and Year
df['Total_no_of_Deaths'] = df[cause_of_deaths].sum(axis=1)
# Top 10 Total_no_of_Deaths
top10_Total_no_of_Deaths = df.sort_values(by='Total_no_of_Deaths',ascending=False)[:10][['Total_no_of_Deaths','Country/Territory']]
top10_Total_no_of_Deaths
| Total_no_of_Deaths | Country/Territory | |
|---|---|---|
| 1139 | 10442561 | China |
| 1138 | 10163943 | China |
| 1137 | 9978653 | China |
| 1119 | 9814213 | China |
| 1118 | 9591222 | China |
| 1117 | 9503904 | China |
| 1116 | 9411928 | China |
| 1114 | 9366974 | China |
| 1115 | 9364587 | China |
| 1113 | 9284664 | China |
# Display descriptive statistics again for reference
df.describe()
| Year | Meningitis | Alzheimer's Disease and Other Dementias | Parkinson's Disease | Nutritional Deficiencies | Malaria | Drowning | Interpersonal Violence | Maternal Disorders | HIV/AIDS | ... | Chronic Kidney Disease | Poisonings | Protein-Energy Malnutrition | Road Injuries | Chronic Respiratory Diseases | Cirrhosis and Other Chronic Liver Diseases | Digestive Diseases | Fire, Heat, and Hot Substances | Acute Hepatitis | Total_no_of_Deaths | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | ... | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 |
| mean | 2004.50 | 1719.70 | 4864.19 | 1173.17 | 2253.60 | 4140.96 | 1683.33 | 2083.80 | 1262.59 | 5941.90 | ... | 4724.13 | 425.01 | 1965.99 | 5930.80 | 17092.37 | 6124.07 | 10725.27 | 588.71 | 618.43 | 239891.29 |
| std | 8.66 | 6672.01 | 18220.66 | 4616.16 | 10483.63 | 18427.75 | 8877.02 | 6917.01 | 6057.97 | 21011.96 | ... | 16470.43 | 2022.64 | 8256.00 | 24097.78 | 105157.18 | 20688.12 | 37228.05 | 2128.60 | 4186.02 | 873713.89 |
| min | 1990.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 7.00 |
| 25% | 1997.00 | 15.00 | 90.00 | 27.00 | 9.00 | 0.00 | 34.00 | 40.00 | 5.00 | 11.00 | ... | 145.75 | 6.00 | 5.00 | 174.75 | 289.00 | 154.00 | 284.00 | 17.00 | 2.00 | 6935.00 |
| 50% | 2004.50 | 109.00 | 666.50 | 164.00 | 119.00 | 0.00 | 177.00 | 265.00 | 54.00 | 136.00 | ... | 822.00 | 52.50 | 92.00 | 966.50 | 1689.00 | 1210.00 | 2185.00 | 126.00 | 15.00 | 50257.50 |
| 75% | 2012.00 | 847.25 | 2456.25 | 609.25 | 1167.25 | 393.00 | 698.00 | 877.00 | 734.00 | 1879.00 | ... | 2922.50 | 254.00 | 1042.50 | 3435.25 | 5249.75 | 3547.25 | 6080.00 | 450.00 | 160.00 | 158221.00 |
| max | 2019.00 | 98358.00 | 320715.00 | 76990.00 | 268223.00 | 280604.00 | 153773.00 | 69640.00 | 107929.00 | 305491.00 | ... | 222922.00 | 30883.00 | 202241.00 | 329237.00 | 1366039.00 | 270037.00 | 464914.00 | 25876.00 | 64305.00 | 10442561.00 |
8 rows × 33 columns
Data Distribution: The large differences between mean and median values, along with the high standard deviations and large range, suggest that the data is highly skewed or has extreme outliers.
Variation: The standard deviations are very large for most causes of death, indicating substantial variability in the number of deaths recorded.
Range: The minimum and maximum values provide insight into the range of deaths recorded. eg, the minimum value for Meningitis is 0, and the maximum is 98,358, showing a wide range.
# Plot the distribution of a specific cause of death
plt.figure(figsize=(10, 6))
sns.histplot(df['Meningitis'], bins=30, kde=True)
plt.title('Distribution of Meningitis Deaths')
plt.xlabel('Number of Deaths')
plt.ylabel('Frequency')
plt.show()
# Boxplot for various causes of death
plt.figure(figsize=(14, 8))
sns.boxplot(data=df[['Meningitis', "Alzheimer's Disease and Other Dementias", "Parkinson's Disease"]])
plt.title('Boxplot of Selected Causes of Death')
plt.xlabel('Cause of Death')
plt.ylabel('Number of Deaths')
plt.xticks(rotation=45)
plt.show()
# Find the total number of each disease
disease_df = df[cause_of_deaths].sum().to_frame().reset_index()
disease_df.rename(columns = {'index': 'Diseases', 0:'Total_deaths'}, inplace = True)
disease_df
| Diseases | Total_deaths | |
|---|---|---|
| 0 | Meningitis | 10524572 |
| 1 | Alzheimer's Disease and Other Dementias | 29768839 |
| 2 | Parkinson's Disease | 7179795 |
| 3 | Nutritional Deficiencies | 13792032 |
| 4 | Malaria | 25342676 |
| 5 | Drowning | 10301999 |
| 6 | Interpersonal Violence | 12752839 |
| 7 | Maternal Disorders | 7727046 |
| 8 | HIV/AIDS | 36364419 |
| 9 | Drug Use Disorders | 2656121 |
| 10 | Tuberculosis | 45850603 |
| 11 | Cardiovascular Diseases | 447741982 |
| 12 | Lower Respiratory Infections | 83770038 |
| 13 | Neonatal Disorders | 76860729 |
| 14 | Alcohol Use Disorders | 4819018 |
| 15 | Self-harm | 23713931 |
| 16 | Exposure to Forces of Nature | 1490132 |
| 17 | Diarrheal Diseases | 66235508 |
| 18 | Environmental Heat and Cold Exposure | 1788851 |
| 19 | Neoplasms | 229758538 |
| 20 | Conflict and Terrorism | 3294053 |
| 21 | Diabetes Mellitus | 31448872 |
| 22 | Chronic Kidney Disease | 28911692 |
| 23 | Poisonings | 2601082 |
| 24 | Protein-Energy Malnutrition | 12031885 |
| 25 | Road Injuries | 36296469 |
| 26 | Chronic Respiratory Diseases | 104605334 |
| 27 | Cirrhosis and Other Chronic Liver Diseases | 37479321 |
| 28 | Digestive Diseases | 65638635 |
| 29 | Fire, Heat, and Hot Substances | 3602914 |
| 30 | Acute Hepatitis | 3784791 |
# Create a Treemap
import plotly.express as px
fig = px.treemap(disease_df,
path = [px.Constant('Total_deaths'), 'Diseases'],
values = 'Total_deaths'
)
# Add some text for labels, title
fig.update_traces(textinfo='label+percent parent')
fig.update_layout(title_text='Percentage of cause of deaths around the world during 1990-2019', title_x=0.5, font_size=15)
fig.show()
df.columns
Index(['Country/Territory', 'Code', 'Year', 'Meningitis',
'Alzheimer's Disease and Other Dementias', 'Parkinson's Disease',
'Nutritional Deficiencies', 'Malaria', 'Drowning',
'Interpersonal Violence', 'Maternal Disorders', 'HIV/AIDS',
'Drug Use Disorders', 'Tuberculosis', 'Cardiovascular Diseases',
'Lower Respiratory Infections', 'Neonatal Disorders',
'Alcohol Use Disorders', 'Self-harm', 'Exposure to Forces of Nature',
'Diarrheal Diseases', 'Environmental Heat and Cold Exposure',
'Neoplasms', 'Conflict and Terrorism', 'Diabetes Mellitus',
'Chronic Kidney Disease', 'Poisonings', 'Protein-Energy Malnutrition',
'Road Injuries', 'Chronic Respiratory Diseases',
'Cirrhosis and Other Chronic Liver Diseases', 'Digestive Diseases',
'Fire, Heat, and Hot Substances', 'Acute Hepatitis',
'Total_no_of_Deaths'],
dtype='object')
# Find the total number of deaths group by country
country_df = df.groupby('Country/Territory')['Total_no_of_Deaths'].sum().sort_values(ascending=False).reset_index()
country_df
| Country/Territory | Total_no_of_Deaths | |
|---|---|---|
| 0 | China | 265408106 |
| 1 | India | 238158165 |
| 2 | United States | 71197802 |
| 3 | Russia | 59591155 |
| 4 | Indonesia | 44046941 |
| ... | ... | ... |
| 199 | Cook Islands | 3999 |
| 200 | Tuvalu | 2962 |
| 201 | Nauru | 2249 |
| 202 | Niue | 591 |
| 203 | Tokelau | 299 |
204 rows × 2 columns
# Find the Top 10 total number of deaths group by country.
Top10_countries = df.groupby('Country/Territory')['Total_no_of_Deaths'].sum().sort_values(ascending=False).head(10).reset_index()
Top10_countries
| Country/Territory | Total_no_of_Deaths | |
|---|---|---|
| 0 | China | 265408106 |
| 1 | India | 238158165 |
| 2 | United States | 71197802 |
| 3 | Russia | 59591155 |
| 4 | Indonesia | 44046941 |
| 5 | Nigeria | 43670014 |
| 6 | Pakistan | 38151878 |
| 7 | Brazil | 32674112 |
| 8 | Japan | 31922807 |
| 9 | Germany | 25559667 |
# Create a bar chart of Top 10 countries with the highest number of deaths
plt.figure(figsize = (12,8))
sns.barplot(data = Top10_countries, x = 'Country/Territory', y = 'Total_no_of_Deaths', color = 'Blue')
# Add some text for labels, title
plt.xlabel('Country', fontsize = 12)
plt.ylabel('Total Number of Deaths', fontsize = 12)
plt.title('Top 10 countries with the highest number of deaths', fontsize =15)
Text(0.5, 1.0, 'Top 10 countries with the highest number of deaths')
# Find the Top 10 Countries with the LOWEST number of deaths
Low10_countries = df.groupby('Country/Territory')['Total_no_of_Deaths'].sum().sort_values(ascending=True).head(10).reset_index()
Low10_countries
| Country/Territory | Total_no_of_Deaths | |
|---|---|---|
| 0 | Tokelau | 299 |
| 1 | Niue | 591 |
| 2 | Nauru | 2249 |
| 3 | Tuvalu | 2962 |
| 4 | Cook Islands | 3999 |
| 5 | Palau | 4814 |
| 6 | San Marino | 6761 |
| 7 | Northern Mariana Islands | 7827 |
| 8 | American Samoa | 8619 |
| 9 | Marshall Islands | 10186 |
plt.figure(figsize=(12,8))
sns.barplot(data = Low10_countries, x = 'Country/Territory', y = 'Total_no_of_Deaths', color = 'Blue')
# Add some text for labels, title
plt.xticks(rotation = 90)
plt.xlabel('Country', fontsize = 12)
plt.ylabel('Total Number of Deaths', fontsize = 12)
plt.title('Top 10 Countries with the lowest number of deaths', fontsize =15)
Text(0.5, 1.0, 'Top 10 Countries with the lowest number of deaths')
# A Treemap for the Percentage of Total Number of Deaths group by country
fig = px.treemap(country_df,
path = [px.Constant('Total_no_of_Deaths'), 'Country/Territory'],
values = 'Total_no_of_Deaths'
)
# Add some text for labels, title
fig.update_traces(textinfo='label+percent parent')
fig.update_layout(title_text='Percentage of total number of deaths around the world', title_x=0.5, font_size=15)
fig.show()
# Group by country and sum the deaths for each cause
country_cause_deaths = df.groupby('Country/Territory')[cause_of_deaths].sum()
# Visualize the top causes of death for a specific country (e.g., 'China')
country_cause_deaths.loc['China'].plot(kind='bar', figsize=(12, 6), color='navy')
plt.title('Causes of Death in China (1990-2019)')
plt.ylabel('Total Number of Deaths')
plt.xticks(rotation=90)
plt.show()
# China - "Total_no_of_Deaths" against "Year"
China_Total_no_of_Deaths_df = df[df['Country/Territory']=='China'].sort_values(by='Total_no_of_Deaths',ascending=False)
# China - "Total_no_of_Deaths" against "Year"
plt.figure(figsize=(8,4),dpi=200)
sns.scatterplot(data=China_Total_no_of_Deaths_df, x='Year', y='Total_no_of_Deaths')
plt.xlabel("Year")
plt.ylabel("Total no.of Deaths")
plt.title("Year Vs. Total no.of Deaths for China")
plt.show();
plt.figure(figsize=(15,8),dpi=200)
sns.barplot(data=China_Total_no_of_Deaths_df, x='Year', y='Total_no_of_Deaths')
plt.xlabel("Year")
plt.ylabel("Total no.of Deaths")
plt.title("Year Vs. Total no.of Deaths for China")
plt.show();
NOTE:clear raise in Total No.of Deaths recorded with each year for China.
Common Cause of death
plt.figure(figsize=(12,8),dpi=200)
china_df = China_Total_no_of_Deaths_df.groupby(['Country/Territory','Year']).sum()
china_df['Malaria'].plot(kind='bar')
plt.xlabel("Year")
plt.ylabel("Malaria Deaths")
plt.title("Year Vs. Malaria Deaths in china")
plt.show();
rapid drop in Malaria Deaths recorded in China after 1999.
plt.figure(figsize=(12,8),dpi=200)
china_df = China_Total_no_of_Deaths_df.groupby(['Country/Territory','Year']).sum()
china_df['Nutritional Deficiencies'].plot(kind='bar')
plt.xlabel("Year")
plt.ylabel("Nutritional Deficiencies Deaths")
plt.title("Year Vs. Nutritional Deficiencies Deaths in china")
plt.show();
drop in Nutritional Deficiencies Deaths recorded in China in 2007 and from 2008 the count of deaths again started to raise.
plt.figure(figsize=(12,8),dpi=200)
china_df = China_Total_no_of_Deaths_df.groupby(['Country/Territory','Year']).sum()
china_df['Interpersonal Violence'].plot(kind='bar')
plt.xlabel("Year")
plt.ylabel("Interpersonal Violence Deaths")
plt.title("Year Vs. Interpersonal Violence Deaths in china")
plt.show();
continual drop in Interpersonal Violence Deaths recorded in China.
# India - "Total_no_of_Deaths" against "Year"
India_Total_no_of_Deaths_df = df[df['Country/Territory']=='India'].sort_values(by='Total_no_of_Deaths',ascending=False)
# India - "Total_no_of_Deaths" against "Year"
plt.figure(figsize=(8,4),dpi=200)
sns.scatterplot(data=India_Total_no_of_Deaths_df, x='Year', y='Total_no_of_Deaths')
plt.xlabel("Year")
plt.ylabel("Total no.of Deaths")
plt.title("Year Vs. Total no.of Deaths for India")
plt.show();
plt.figure(figsize=(15,8),dpi=200)
sns.barplot(data=India_Total_no_of_Deaths_df, x='Year', y='Total_no_of_Deaths')
plt.xlabel("Year")
plt.ylabel("Total no.of Deaths")
plt.title("Year Vs. Total no.of Deaths for China")
plt.show();
Overall there is a raise in Total No.of Deaths recorded with each year for India, even though there are fluctuations inbetween.
Common Causes of Death
plt.figure(figsize=(12,8),dpi=200)
china_df = India_Total_no_of_Deaths_df.groupby(['Country/Territory','Year']).sum()
china_df['Malaria'].plot(kind='bar')
plt.xlabel("Year")
plt.ylabel("Malaria Deaths")
plt.title("Year Vs. Malaria Deaths in India")
plt.show();
There is a rapid drop in Malaria Deaths recorded in India from 1990, but the Deaths in 2018 and 2019 is greater than that of 2016 and 2017.
# Total causes of death across 30 years
Countries_Total_no_of_Deaths_noyear_df = df.groupby('Country/Territory').sum()
Countries_Total_no_of_Deaths_noyear_df.drop('Year',axis=1,inplace=True)
# Top 3 Countries interms of "Total no.of Deaths" - For All the Years
Countries_Total_no_of_Deaths_noyear_df.sort_values(by='Total_no_of_Deaths',ascending =False)[:3]
| Code | Meningitis | Alzheimer's Disease and Other Dementias | Parkinson's Disease | Nutritional Deficiencies | Malaria | Drowning | Interpersonal Violence | Maternal Disorders | HIV/AIDS | ... | Chronic Kidney Disease | Poisonings | Protein-Energy Malnutrition | Road Injuries | Chronic Respiratory Diseases | Cirrhosis and Other Chronic Liver Diseases | Digestive Diseases | Fire, Heat, and Hot Substances | Acute Hepatitis | Total_no_of_Deaths | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Country/Territory | |||||||||||||||||||||
| China | CHNCHNCHNCHNCHNCHNCHNCHNCHNCHNCHNCHNCHNCHNCHNC... | 480899 | 5381846 | 1533092 | 584236 | 13418 | 2873619 | 776275 | 243257 | 433709 | ... | 4195276 | 770140 | 507664 | 8350399 | 36676826 | 4918899 | 8924906 | 383402 | 318564 | 265408106 |
| India | INDINDINDINDINDINDINDINDINDINDINDINDINDINDINDI... | 2008944 | 1707561 | 756832 | 3290569 | 2439244 | 2110438 | 1237163 | 2292449 | 2454374 | ... | 4556172 | 170119 | 2356222 | 5346154 | 25232974 | 6294910 | 11804380 | 730580 | 1672179 | 238158165 |
| United States | USAUSAUSAUSAUSAUSAUSAUSAUSAUSAUSAUSAUSAUSAUSAU... | 40032 | 3302609 | 661288 | 133044 | 0 | 114752 | 596818 | 25206 | 528417 | ... | 2018497 | 40259 | 121030 | 1359744 | 4949052 | 1514325 | 3026943 | 126712 | 5851 | 71197802 |
3 rows × 33 columns
df.head()
| Country/Territory | Code | Year | Meningitis | Alzheimer's Disease and Other Dementias | Parkinson's Disease | Nutritional Deficiencies | Malaria | Drowning | Interpersonal Violence | ... | Chronic Kidney Disease | Poisonings | Protein-Energy Malnutrition | Road Injuries | Chronic Respiratory Diseases | Cirrhosis and Other Chronic Liver Diseases | Digestive Diseases | Fire, Heat, and Hot Substances | Acute Hepatitis | Total_no_of_Deaths | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | AFG | 1990 | 2159 | 1116 | 371 | 2087 | 93 | 1370 | 1538 | ... | 3709 | 338 | 2054 | 4154 | 5945 | 2673 | 5005 | 323 | 2985 | 147971 |
| 1 | Afghanistan | AFG | 1991 | 2218 | 1136 | 374 | 2153 | 189 | 1391 | 2001 | ... | 3724 | 351 | 2119 | 4472 | 6050 | 2728 | 5120 | 332 | 3092 | 156844 |
| 2 | Afghanistan | AFG | 1992 | 2475 | 1162 | 378 | 2441 | 239 | 1514 | 2299 | ... | 3776 | 386 | 2404 | 5106 | 6223 | 2830 | 5335 | 360 | 3325 | 169156 |
| 3 | Afghanistan | AFG | 1993 | 2812 | 1187 | 384 | 2837 | 108 | 1687 | 2589 | ... | 3862 | 425 | 2797 | 5681 | 6445 | 2943 | 5568 | 396 | 3601 | 182230 |
| 4 | Afghanistan | AFG | 1994 | 3027 | 1211 | 391 | 3081 | 211 | 1809 | 2849 | ... | 3932 | 451 | 3038 | 6001 | 6664 | 3027 | 5739 | 420 | 3816 | 194795 |
5 rows × 35 columns
Identify Top Causes of Death Globally:
Calculate the total number of deaths for each cause across all years and countries.
# Calculate total deaths for each cause
total_deaths = df.drop(columns=['Country/Territory', 'Code', 'Year']).sum().sort_values(ascending=False)
# Display the top causes of death
print(total_deaths)
Total_no_of_Deaths 1468134716 Cardiovascular Diseases 447741982 Neoplasms 229758538 Chronic Respiratory Diseases 104605334 Lower Respiratory Infections 83770038 Neonatal Disorders 76860729 Diarrheal Diseases 66235508 Digestive Diseases 65638635 Tuberculosis 45850603 Cirrhosis and Other Chronic Liver Diseases 37479321 HIV/AIDS 36364419 Road Injuries 36296469 Diabetes Mellitus 31448872 Alzheimer's Disease and Other Dementias 29768839 Chronic Kidney Disease 28911692 Malaria 25342676 Self-harm 23713931 Nutritional Deficiencies 13792032 Interpersonal Violence 12752839 Protein-Energy Malnutrition 12031885 Meningitis 10524572 Drowning 10301999 Maternal Disorders 7727046 Parkinson's Disease 7179795 Alcohol Use Disorders 4819018 Acute Hepatitis 3784791 Fire, Heat, and Hot Substances 3602914 Conflict and Terrorism 3294053 Drug Use Disorders 2656121 Poisonings 2601082 Environmental Heat and Cold Exposure 1788851 Exposure to Forces of Nature 1490132 dtype: int64
china_10 = Countries_Total_no_of_Deaths_noyear_df.sort_values(by='Total_no_of_Deaths',ascending =False)[:1]
china_10.T
| Country/Territory | China |
|---|---|
| Code | CHNCHNCHNCHNCHNCHNCHNCHNCHNCHNCHNCHNCHNCHNCHNC... |
| Meningitis | 480899 |
| Alzheimer's Disease and Other Dementias | 5381846 |
| Parkinson's Disease | 1533092 |
| Nutritional Deficiencies | 584236 |
| Malaria | 13418 |
| Drowning | 2873619 |
| Interpersonal Violence | 776275 |
| Maternal Disorders | 243257 |
| HIV/AIDS | 433709 |
| Drug Use Disorders | 626914 |
| Tuberculosis | 2708461 |
| Cardiovascular Diseases | 100505973 |
| Lower Respiratory Infections | 8525819 |
| Neonatal Disorders | 4353666 |
| Alcohol Use Disorders | 485796 |
| Self-harm | 5078550 |
| Exposure to Forces of Nature | 138961 |
| Diarrheal Diseases | 886833 |
| Environmental Heat and Cold Exposure | 198582 |
| Neoplasms | 61060527 |
| Conflict and Terrorism | 3043 |
| Diabetes Mellitus | 3468554 |
| Chronic Kidney Disease | 4195276 |
| Poisonings | 770140 |
| Protein-Energy Malnutrition | 507664 |
| Road Injuries | 8350399 |
| Chronic Respiratory Diseases | 36676826 |
| Cirrhosis and Other Chronic Liver Diseases | 4918899 |
| Digestive Diseases | 8924906 |
| Fire, Heat, and Hot Substances | 383402 |
| Acute Hepatitis | 318564 |
| Total_no_of_Deaths | 265408106 |
# Access the first row (assuming it is China)
china_data = china_10.iloc[0]
# Convert all values to numeric, coercing errors to NaN
china_data = pd.to_numeric(china_data, errors='coerce')
# Drop NaN values if they exist
# china_data = china_data.dropna()
# Sort the values
sorted_china_data = china_data.sort_values(ascending=False)
# Get top 10 causes
top_10_china = sorted_china_data.head(10)
# Print top 10 causes
print(top_10_china)
Total_no_of_Deaths 265408106.00 Cardiovascular Diseases 100505973.00 Neoplasms 61060527.00 Chronic Respiratory Diseases 36676826.00 Digestive Diseases 8924906.00 Lower Respiratory Infections 8525819.00 Road Injuries 8350399.00 Alzheimer's Disease and Other Dementias 5381846.00 Self-harm 5078550.00 Cirrhosis and Other Chronic Liver Diseases 4918899.00 Name: China, dtype: float64
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming df is your DataFrame
# Exclude the columns you don't want
columns_to_exclude = ['Rolling_Avg_Total_Deaths', 'Total_no_of_Deaths']
columns_to_plot = [col for col in df.columns if col not in columns_to_exclude]
# Reshape the DataFrame from wide to long format
df_long = df.melt(id_vars=['Year'], value_vars=columns_to_plot, var_name='Cause', value_name='Deaths')
# Ensure data types are correct
df_long['Year'] = df_long['Year'].astype(str) # Convert Year to string if it's not
df_long['Deaths'] = pd.to_numeric(df_long['Deaths'], errors='coerce') # Ensure Deaths is numeric
# Plotting example with seaborn
plt.figure(figsize=(12, 8))
sns.lineplot(data=df_long, x='Year', y='Deaths', hue='Cause')
plt.title('Death Causes Excluding Specified Columns')
plt.xlabel('Year')
plt.ylabel('Deaths')
plt.legend(title='Cause')
plt.show()
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 4), dpi=200)
top_10_china.plot(kind='barh')
plt.xlabel("Total no. of Deaths")
plt.ylabel("Causes of Deaths")
plt.title("Top 10 Causes of Deaths in China")
plt.show()
# Access the data for India
India_10 = Countries_Total_no_of_Deaths_noyear_df.loc[Countries_Total_no_of_Deaths_noyear_df.index == 'India']
# Drop the 'Total_no_of_Deaths' column and convert the data to numeric
india_data = pd.to_numeric(India_10.iloc[0].drop('Total_no_of_Deaths'), errors='coerce')
# Drop NaN values if they exist
india_data = india_data.dropna()
# Sort the values
sorted_india_data = india_data.sort_values(ascending=False)
# Get top 10 causes
top_10_india = sorted_india_data.head(10)
# Print top 10 causes
print(top_10_india)
Cardiovascular Diseases 52994710.00 Diarrheal Diseases 26243547.00 Chronic Respiratory Diseases 25232974.00 Neonatal Disorders 20911570.00 Neoplasms 17762703.00 Lower Respiratory Infections 16419404.00 Tuberculosis 15820922.00 Digestive Diseases 11804380.00 Cirrhosis and Other Chronic Liver Diseases 6294910.00 Self-harm 5543395.00 Name: India, dtype: float64
# Plot
plt.figure(figsize=(8, 4), dpi=200)
top_10_india.plot(kind='barh')
plt.xlabel("Total no. of Deaths")
plt.ylabel("Causes of Deaths")
plt.title("Top 10 Causes of Deaths in India")
plt.show()
# Access the data for the United States
usa_data = Countries_Total_no_of_Deaths_noyear_df.loc['United States']
# Check if 'USA' is in the index
print(Countries_Total_no_of_Deaths_noyear_df.index)
Index(['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra',
'Angola', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia',
...
'United States', 'United States Virgin Islands', 'Uruguay',
'Uzbekistan', 'Vanuatu', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia',
'Zimbabwe'],
dtype='object', name='Country/Territory', length=204)
# Drop the 'Total_no_of_Deaths' column if present and convert to numeric
if 'Total_no_of_Deaths' in usa_data.index:
usa_data = usa_data.drop('Total_no_of_Deaths')
usa_data = pd.to_numeric(usa_data, errors='coerce')
# Drop NaNs if they exist
usa_data = usa_data.dropna()
# Sort the values
sorted_usa_data = usa_data.sort_values(ascending=False)
# Get top 10 causes
top_10_usa = sorted_usa_data.head(10)
# Print top 10 causes
print(top_10_usa)
Cardiovascular Diseases 26438346.00 Neoplasms 18905315.00 Chronic Respiratory Diseases 4949052.00 Alzheimer's Disease and Other Dementias 3302609.00 Digestive Diseases 3026943.00 Lower Respiratory Infections 2248625.00 Diabetes Mellitus 2030631.00 Chronic Kidney Disease 2018497.00 Cirrhosis and Other Chronic Liver Diseases 1514325.00 Road Injuries 1359744.00 Name: United States, dtype: float64
# Plot
plt.figure(figsize=(8, 4), dpi=200)
top_10_usa.plot(kind='barh')
plt.xlabel("Total no. of Deaths")
plt.ylabel("Causes of Deaths")
plt.title("Top 10 Causes of Deaths in the United States")
plt.show()
# Group data by year and calculate the total number of deaths per year
deaths_per_year = df.groupby('Year')['Total_no_of_Deaths'].sum()
# Plot the trend of deaths over time
plt.figure(figsize=(12, 6))
plt.plot(deaths_per_year, marker='o')
plt.title('Total Deaths Over Time (1990-2019)')
plt.xlabel('Year')
plt.ylabel('Total Number of Deaths')
plt.grid(True)
plt.show()
import matplotlib.pyplot as plt
import seaborn as sns
# Prepare data by grouping and summing
trend_df = df.groupby(['Year']).sum()
# Plot trend for a specific cause of death
plt.figure(figsize=(12, 8), dpi=200)
sns.lineplot(data=trend_df, x='Year', y='Cardiovascular Diseases', marker='o')
plt.xlabel("Year")
plt.ylabel("Cardiovascular Diseases Deaths")
plt.title("Trend of Cardiovascular Diseases Deaths Over Time")
plt.grid(True)
plt.show()
Highlight Increases or Decreases
calculate the percentage change between years:
# Calculate percentage change
trend_df['Cardiovascular Diseases Change'] = trend_df['Cardiovascular Diseases'].pct_change() * 100
# Plot percentage change
plt.figure(figsize=(12, 8), dpi=200)
sns.lineplot(data=trend_df, x=trend_df.index, y='Cardiovascular Diseases Change', marker='o')
plt.xlabel("Year")
plt.ylabel("Percentage Change in Cardiovascular Diseases Deaths")
plt.title("Percentage Change in Cardiovascular Diseases Deaths Over Time")
plt.grid(True)
plt.show()
Identify Top Causes of Death Globally and Regionally
# Global top causes of death
total_deaths = df.drop(columns=['Country/Territory', 'Code', 'Year']).sum().sort_values(ascending=False)
top_global_causes = total_deaths.head(10)
plt.figure(figsize=(12, 8), dpi=200)
top_global_causes.plot(kind='barh')
plt.xlabel("Total Number of Deaths")
plt.ylabel("Causes of Death")
plt.title("Top 10 Causes of Deaths Globally")
plt.show()
Advanced Visualizations
# Trend visualization with Seaborn
plt.figure(figsize=(12, 8), dpi=200)
sns.lineplot(data=trend_df, x=trend_df.index, y='Cardiovascular Diseases', marker='o', color='b')
plt.xlabel("Year")
plt.ylabel("Cardiovascular Diseases Deaths")
plt.title("Trend of Cardiovascular Diseases Deaths Over Time")
plt.grid(True)
plt.show()
# Heatmap for global causes of death
plt.figure(figsize=(14, 10), dpi=200)
sns.heatmap(df.drop(columns=['Country/Territory', 'Code', 'Year']).corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap of Causes of Death")
plt.show()
# Trend Analysis:
# Analyze trends over time to identify increases or decreases.
import matplotlib.pyplot as plt
# Plot trends for a specific cause (e.g., Cardiovascular Diseases)
df.groupby('Year')['Cardiovascular Diseases'].sum().plot()
plt.title('Trends in Cardiovascular Diseases')
plt.xlabel('Year')
plt.ylabel('Number of Deaths')
plt.show()
Regional Analysis:
Compare the top causes of death across different countries.
# Calculate the average number of deaths by country
# Ensure all relevant columns are numeric
for col in df.columns[3:]: # Adjust based on your actual columns
df[col] = pd.to_numeric(df[col], errors='coerce')
# Keep only numeric columns for aggregation
df_numeric = df.select_dtypes(include=['number'])
# Calculate the average number of deaths by country
average_deaths_by_country = df_numeric.groupby(df['Country/Territory']).mean()
# Display the top countries for a specific cause, e.g., 'Cardiovascular Diseases'
top_countries = average_deaths_by_country['Cardiovascular Diseases'].sort_values(ascending=False)
print(top_countries.head(10))
Country/Territory China 3350199.10 India 1766490.33 Russia 1130126.03 United States 881278.20 Indonesia 452900.37 Ukraine 435101.73 Germany 360659.00 Brazil 319633.97 Japan 307014.57 Pakistan 258173.07 Name: Cardiovascular Diseases, dtype: float64
# improve readability
pd.options.display.float_format = '{:.2f}'.format
print(top_countries.head(10))
Country/Territory China 3350199.10 India 1766490.33 Russia 1130126.03 United States 881278.20 Indonesia 452900.37 Ukraine 435101.73 Germany 360659.00 Brazil 319633.97 Japan 307014.57 Pakistan 258173.07 Name: Cardiovascular Diseases, dtype: float64
df.head()
| Country/Territory | Code | Year | Meningitis | Alzheimer's Disease and Other Dementias | Parkinson's Disease | Nutritional Deficiencies | Malaria | Drowning | Interpersonal Violence | ... | Chronic Kidney Disease | Poisonings | Protein-Energy Malnutrition | Road Injuries | Chronic Respiratory Diseases | Cirrhosis and Other Chronic Liver Diseases | Digestive Diseases | Fire, Heat, and Hot Substances | Acute Hepatitis | Total_no_of_Deaths | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | AFG | 1990 | 2159 | 1116 | 371 | 2087 | 93 | 1370 | 1538 | ... | 3709 | 338 | 2054 | 4154 | 5945 | 2673 | 5005 | 323 | 2985 | 147971 |
| 1 | Afghanistan | AFG | 1991 | 2218 | 1136 | 374 | 2153 | 189 | 1391 | 2001 | ... | 3724 | 351 | 2119 | 4472 | 6050 | 2728 | 5120 | 332 | 3092 | 156844 |
| 2 | Afghanistan | AFG | 1992 | 2475 | 1162 | 378 | 2441 | 239 | 1514 | 2299 | ... | 3776 | 386 | 2404 | 5106 | 6223 | 2830 | 5335 | 360 | 3325 | 169156 |
| 3 | Afghanistan | AFG | 1993 | 2812 | 1187 | 384 | 2837 | 108 | 1687 | 2589 | ... | 3862 | 425 | 2797 | 5681 | 6445 | 2943 | 5568 | 396 | 3601 | 182230 |
| 4 | Afghanistan | AFG | 1994 | 3027 | 1211 | 391 | 3081 | 211 | 1809 | 2849 | ... | 3932 | 451 | 3038 | 6001 | 6664 | 3027 | 5739 | 420 | 3816 | 194795 |
5 rows × 35 columns
Sample Data for Regional Comparisons
# Sample data: Top 10 countries with the highest number of deaths from Cardiovascular Diseases
top_countries = df.groupby('Country/Territory')['Cardiovascular Diseases'].sum().nlargest(10)
# Set figure size
plt.figure(figsize=(12, 8))
# Create a bar plot
sns.barplot(x=top_countries.index, y=top_countries.values, palette='viridis')
# Rotate x-axis labels for better readability
plt.xticks(rotation=45)
# Add titles and labels
plt.title('Top 10 Countries/Territories by Cardiovascular Diseases', fontsize=16)
plt.xlabel('Country/Territory', fontsize=14)
plt.ylabel('Number of Deaths', fontsize=14)
# Improve layout and avoid clipping
plt.tight_layout()
# Show plot
plt.show()
Statistical Analysis
Correlation Analysis:
# Select only numerical columns for correlation analysis
numerical_cols = df.select_dtypes(include=['number']).columns
df_numerical = df[numerical_cols]
# Calculate correlation matrix
correlation_matrix = df_numerical.corr()
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', vmin=-1, vmax=1)
plt.title('Correlation Matrix of Numerical Features')
plt.show()
df.columns
Index(['Country/Territory', 'Code', 'Year', 'Meningitis',
'Alzheimer's Disease and Other Dementias', 'Parkinson's Disease',
'Nutritional Deficiencies', 'Malaria', 'Drowning',
'Interpersonal Violence', 'Maternal Disorders', 'HIV/AIDS',
'Drug Use Disorders', 'Tuberculosis', 'Cardiovascular Diseases',
'Lower Respiratory Infections', 'Neonatal Disorders',
'Alcohol Use Disorders', 'Self-harm', 'Exposure to Forces of Nature',
'Diarrheal Diseases', 'Environmental Heat and Cold Exposure',
'Neoplasms', 'Conflict and Terrorism', 'Diabetes Mellitus',
'Chronic Kidney Disease', 'Poisonings', 'Protein-Energy Malnutrition',
'Road Injuries', 'Chronic Respiratory Diseases',
'Cirrhosis and Other Chronic Liver Diseases', 'Digestive Diseases',
'Fire, Heat, and Hot Substances', 'Acute Hepatitis',
'Total_no_of_Deaths'],
dtype='object')
# Define the columns to include in the sample correlation matrix
sample_columns = [
'Meningitis',
'Nutritional Deficiencies',
'Alzheimer\'s Disease and Other Dementias',
'Parkinson\'s Disease',
'Malaria',
'Drowning',
'Interpersonal Violence'
]
# Filter the DataFrame to only include these columns
df_sample = df[sample_columns]
# Calculate the correlation matrix for the sample
correlation_matrix_sample = df_sample.corr()
# Plot the heatmap of the sample correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix_sample, annot=True, cmap='coolwarm', fmt='.2f', vmin=-1, vmax=1)
plt.title('Sample Correlation Matrix of Selected Features')
plt.show()
# Top Causes of Death: Identify and visualize the top causes of death.
top_causes = df[['Meningitis', 'Nutritional Deficiencies', 'Alzheimer\'s Disease and Other Dementias']].sum().sort_values(ascending=False)
plt.figure(figsize=(10, 6))
top_causes.plot(kind='bar', color='teal')
plt.xlabel('Cause of Death')
plt.ylabel('Total Deaths')
plt.title('Top Causes of Death Globally')
plt.show()
# Interactive Visualization
import plotly.express as px
# Top causes of death
top_causes = total_deaths.head(10).reset_index()
top_causes.columns = ['Cause', 'Total Deaths']
fig = px.bar(top_causes, x='Total Deaths', y='Cause', title='Top 10 Causes of Death Globally', orientation='h')
fig.show()
diseases = ['Meningitis',
'Alzheimer\'s Disease and Other Dementias',
'Parkinson\'s Disease',
'Nutritional Deficiencies',
'Malaria',
'Drowning',
'Interpersonal Violence',
'Maternal Disorders',
'HIV/AIDS',
'Drug Use Disorders',
'Tuberculosis',
'Cardiovascular Diseases',
'Lower Respiratory Infections',
'Neonatal Disorders',
'Alcohol Use Disorders',
'Self-harm',
'Exposure to Forces of Nature',
'Diarrheal Diseases',
'Environmental Heat and Cold Exposure',
'Neoplasms',
'Conflict and Terrorism',
'Diabetes Mellitus',
'Chronic Kidney Disease',
'Poisonings',
'Protein-Energy Malnutrition',
'Road Injuries',
'Chronic Respiratory Diseases',
'Cirrhosis and Other Chronic Liver Diseases',
'Digestive Diseases',
'Fire, Heat, and Hot Substances',
'Acute Hepatitis']
for x in diseases:
if df[x].dtypes != 'string':
data = df.groupby(['Country/Territory'])[x].sum().sort_values(ascending=False)[:10]
plt.figure(figsize=(12,6))
plt.bar(data=data, x=data.index, height=data.values, width=0.9,
color=['crimson', 'blue', 'green', 'yellow', 'magenta'])
plt.xticks(rotation='vertical')
plt.xlabel("COUNTRIES", size=10)
plt.ylabel(x.upper() + ' DEATHS PER MILLION')
plt.title("COUNTRIES WITH HIGHEST " + x.upper() + ' DEATHS')
plt.show()
Trends Over Time:
# Find the total number of deaths group by year
Deaths_by_year = df.groupby('Year')['Total_no_of_Deaths'].sum().reset_index()
Deaths_by_year
| Year | Total_no_of_Deaths | |
|---|---|---|
| 0 | 1990 | 43518516 |
| 1 | 1991 | 44059729 |
| 2 | 1992 | 44459130 |
| 3 | 1993 | 45185713 |
| 4 | 1994 | 46182613 |
| 5 | 1995 | 46177018 |
| 6 | 1996 | 46320827 |
| 7 | 1997 | 46672370 |
| 8 | 1998 | 47066088 |
| 9 | 1999 | 47652090 |
| 10 | 2000 | 48050317 |
| 11 | 2001 | 48385692 |
| 12 | 2002 | 48897031 |
| 13 | 2003 | 49123952 |
| 14 | 2004 | 49330171 |
| 15 | 2005 | 49591909 |
| 16 | 2006 | 49424521 |
| 17 | 2007 | 49495216 |
| 18 | 2008 | 50115740 |
| 19 | 2009 | 49900666 |
| 20 | 2010 | 50422775 |
| 21 | 2011 | 50413303 |
| 22 | 2012 | 50597654 |
| 23 | 2013 | 50931550 |
| 24 | 2014 | 51268375 |
| 25 | 2015 | 51856393 |
| 26 | 2016 | 52337435 |
| 27 | 2017 | 52789758 |
| 28 | 2018 | 53545244 |
| 29 | 2019 | 54362920 |
# Create line chart
plt.figure(figsize=(12,8))
sns.lineplot(data = Deaths_by_year, x = 'Year', y = 'Total_no_of_Deaths')
plt.xlabel('Year',fontsize =12)
plt.ylabel('Total Number of Deaths',fontsize =12)
plt.title('Time series of total number of deaths around the world', fontsize=15)
Text(0.5, 1.0, 'Time series of total number of deaths around the world')
df.columns
Index(['Country/Territory', 'Code', 'Year', 'Meningitis',
'Alzheimer's Disease and Other Dementias', 'Parkinson's Disease',
'Nutritional Deficiencies', 'Malaria', 'Drowning',
'Interpersonal Violence', 'Maternal Disorders', 'HIV/AIDS',
'Drug Use Disorders', 'Tuberculosis', 'Cardiovascular Diseases',
'Lower Respiratory Infections', 'Neonatal Disorders',
'Alcohol Use Disorders', 'Self-harm', 'Exposure to Forces of Nature',
'Diarrheal Diseases', 'Environmental Heat and Cold Exposure',
'Neoplasms', 'Conflict and Terrorism', 'Diabetes Mellitus',
'Chronic Kidney Disease', 'Poisonings', 'Protein-Energy Malnutrition',
'Road Injuries', 'Chronic Respiratory Diseases',
'Cirrhosis and Other Chronic Liver Diseases', 'Digestive Diseases',
'Fire, Heat, and Hot Substances', 'Acute Hepatitis',
'Total_no_of_Deaths'],
dtype='object')
# Create line chart to Compare the Total Number of Deaths Between Top 10 Countries
plt.figure(figsize=(12,8))
for i in Top10_countries['Country/Territory']:
a= df[df['Country/Territory']==i]
sns.lineplot(data=a, x='Year', y='Total_no_of_Deaths',label=i)
plt.xlabel('Year',fontsize =12)
plt.ylabel('Total Number of Deaths',fontsize =12)
plt.title('Time series compare the total number of deaths between top 10 countries', fontsize=15)
Text(0.5, 1.0, 'Time series compare the total number of deaths between top 10 countries')
import matplotlib.pyplot as plt
# Group by year and calculate the mean deaths for a specific cause, e.g., 'Cardiovascular Diseases'
trend_cardio = df.groupby('Year')['Cardiovascular Diseases'].mean()
# Plot the trend over time
plt.figure(figsize=(12, 6))
plt.plot(trend_cardio.index, trend_cardio.values, marker='o')
plt.xlabel('Year')
plt.ylabel('Average Deaths')
plt.title('Trend of Cardiovascular Diseases Over Time')
plt.grid(True)
plt.show()
Visualizing Top Causes of Death
import seaborn as sns
# Calculate total deaths for each cause
total_deaths = df.drop(columns=['Country/Territory', 'Code', 'Year']).sum().sort_values(ascending=False)
# Create a DataFrame for top causes of death
top_causes = total_deaths.head(10).reset_index()
top_causes.columns = ['Cause', 'Total Deaths']
# Plot
plt.figure(figsize=(12, 8))
sns.barplot(x='Total Deaths', y='Cause', data=top_causes, palette='viridis')
plt.xlabel('Total Deaths')
plt.title('Top 10 Causes of Death Globally')
plt.show()
Analyze Outlier Distribution:
plt.figure(figsize=(15,10))
sns.boxplot(data=df.drop(columns=['Country/Territory', 'Code', 'Year']))
plt.xticks(rotation=90)
plt.show()
# Visualize distributions of key features
df.hist(bins=30, figsize=(20,15))
plt.show()
for column in df.select_dtypes(include=[np.number]).columns:
sns.boxplot(x=df[column])
plt.title(f'Boxplot of {column}')
plt.show()
# List of African countries (this can be expanded or adjusted as needed)
african_countries = [
'Algeria', 'Angola', 'Benin', 'Botswana', 'Burkina Faso', 'Burundi',
'Cabo Verde', 'Cameroon', 'Central African Republic', 'Chad', 'Comoros',
'Congo', 'Cote d\'Ivoire', 'Djibouti', 'DR Congo', 'Egypt', 'Equatorial Guinea',
'Eritrea', 'Eswatini', 'Ethiopia', 'Gabon', 'Gambia', 'Ghana', 'Guinea',
'Guinea-Bissau', 'Kenya', 'Lesotho', 'Liberia', 'Libya', 'Madagascar', 'Malawi',
'Mali', 'Mauritania', 'Mauritius', 'Morocco', 'Mozambique', 'Namibia',
'Niger', 'Nigeria', 'Rwanda', 'Sao Tome and Principe', 'Senegal', 'Seychelles',
'Sierra Leone', 'Somalia', 'South Africa', 'South Sudan', 'Sudan', 'Tanzania',
'Togo', 'Tunisia', 'Uganda', 'Zambia', 'Zimbabwe'
]
# Filter out African countries
non_african_malaria = df[~df['Country/Territory'].isin(african_countries)]
malaria_non_african = non_african_malaria.groupby("Country/Territory")["Malaria"].sum().sort_values(ascending=False).head(10)
malaria_non_african
Country/Territory Democratic Republic of Congo 2557219 India 2439244 Bangladesh 349375 Pakistan 213590 Myanmar 157143 Yemen 143463 Indonesia 74664 Brazil 39970 Papua New Guinea 35997 Haiti 28833 Name: Malaria, dtype: int64
Create New Features:
Yearly Trends: You might want to create features that capture trends over time.
# Example: Change in Meningitis cases from the previous year
df['meningitis_change'] = df.groupby('Country/Territory')['Meningitis'].diff().fillna(0)
Yearly Trend Features: You could create features that capture the trend of deaths over the years.
# Example: Cumulative sum of deaths
df['Cumulative_Deaths'] = df.groupby('Country/Territory')['Total_no_of_Deaths'].cumsum()
# Check data types
df.dtypes
Country/Territory object Code object Year int64 Meningitis int64 Alzheimer's Disease and Other Dementias int64 Parkinson's Disease int64 Nutritional Deficiencies int64 Malaria int64 Drowning int64 Interpersonal Violence int64 Maternal Disorders int64 HIV/AIDS int64 Drug Use Disorders int64 Tuberculosis int64 Cardiovascular Diseases int64 Lower Respiratory Infections int64 Neonatal Disorders int64 Alcohol Use Disorders int64 Self-harm int64 Exposure to Forces of Nature int64 Diarrheal Diseases int64 Environmental Heat and Cold Exposure int64 Neoplasms int64 Conflict and Terrorism int64 Diabetes Mellitus int64 Chronic Kidney Disease int64 Poisonings int64 Protein-Energy Malnutrition int64 Road Injuries int64 Chronic Respiratory Diseases int64 Cirrhosis and Other Chronic Liver Diseases int64 Digestive Diseases int64 Fire, Heat, and Hot Substances int64 Acute Hepatitis int64 Total_no_of_Deaths int64 meningitis_change float64 Cumulative_Deaths int64 dtype: object
df['Year'] = df['Year'].astype(int)
df.dtypes
Country/Territory object Code object Year int32 Meningitis int64 Alzheimer's Disease and Other Dementias int64 Parkinson's Disease int64 Nutritional Deficiencies int64 Malaria int64 Drowning int64 Interpersonal Violence int64 Maternal Disorders int64 HIV/AIDS int64 Drug Use Disorders int64 Tuberculosis int64 Cardiovascular Diseases int64 Lower Respiratory Infections int64 Neonatal Disorders int64 Alcohol Use Disorders int64 Self-harm int64 Exposure to Forces of Nature int64 Diarrheal Diseases int64 Environmental Heat and Cold Exposure int64 Neoplasms int64 Conflict and Terrorism int64 Diabetes Mellitus int64 Chronic Kidney Disease int64 Poisonings int64 Protein-Energy Malnutrition int64 Road Injuries int64 Chronic Respiratory Diseases int64 Cirrhosis and Other Chronic Liver Diseases int64 Digestive Diseases int64 Fire, Heat, and Hot Substances int64 Acute Hepatitis int64 Total_no_of_Deaths int64 meningitis_change float64 Cumulative_Deaths int64 dtype: object
df.drop(['Country/Territory', 'Code'], axis=1, inplace=True)
Renaming columns to a cleaner format in the original DataFrame
# 1. Renaming columns to a cleaner format
df.columns = df.columns.str.lower().str.replace("'", "").str.replace(" ", "_")
df.head()
| year | meningitis | alzheimers_disease_and_other_dementias | parkinsons_disease | nutritional_deficiencies | malaria | drowning | interpersonal_violence | maternal_disorders | hiv/aids | ... | protein-energy_malnutrition | road_injuries | chronic_respiratory_diseases | cirrhosis_and_other_chronic_liver_diseases | digestive_diseases | fire,_heat,_and_hot_substances | acute_hepatitis | total_no_of_deaths | meningitis_change | cumulative_deaths | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1990 | 2159 | 1116 | 371 | 2087 | 93 | 1370 | 1538 | 2655 | 34 | ... | 2054 | 4154 | 5945 | 2673 | 5005 | 323 | 2985 | 147971 | 0.00 | 147971 |
| 1 | 1991 | 2218 | 1136 | 374 | 2153 | 189 | 1391 | 2001 | 2885 | 41 | ... | 2119 | 4472 | 6050 | 2728 | 5120 | 332 | 3092 | 156844 | 59.00 | 304815 |
| 2 | 1992 | 2475 | 1162 | 378 | 2441 | 239 | 1514 | 2299 | 3315 | 48 | ... | 2404 | 5106 | 6223 | 2830 | 5335 | 360 | 3325 | 169156 | 257.00 | 473971 |
| 3 | 1993 | 2812 | 1187 | 384 | 2837 | 108 | 1687 | 2589 | 3671 | 56 | ... | 2797 | 5681 | 6445 | 2943 | 5568 | 396 | 3601 | 182230 | 337.00 | 656201 |
| 4 | 1994 | 3027 | 1211 | 391 | 3081 | 211 | 1809 | 2849 | 3863 | 63 | ... | 3038 | 6001 | 6664 | 3027 | 5739 | 420 | 3816 | 194795 | 215.00 | 850996 |
5 rows × 35 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6120 entries, 0 to 6119 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 6120 non-null int32 1 meningitis 6120 non-null int64 2 alzheimers_disease_and_other_dementias 6120 non-null int64 3 parkinsons_disease 6120 non-null int64 4 nutritional_deficiencies 6120 non-null int64 5 malaria 6120 non-null int64 6 drowning 6120 non-null int64 7 interpersonal_violence 6120 non-null int64 8 maternal_disorders 6120 non-null int64 9 hiv/aids 6120 non-null int64 10 drug_use_disorders 6120 non-null int64 11 tuberculosis 6120 non-null int64 12 cardiovascular_diseases 6120 non-null int64 13 lower_respiratory_infections 6120 non-null int64 14 neonatal_disorders 6120 non-null int64 15 alcohol_use_disorders 6120 non-null int64 16 self-harm 6120 non-null int64 17 exposure_to_forces_of_nature 6120 non-null int64 18 diarrheal_diseases 6120 non-null int64 19 environmental_heat_and_cold_exposure 6120 non-null int64 20 neoplasms 6120 non-null int64 21 conflict_and_terrorism 6120 non-null int64 22 diabetes_mellitus 6120 non-null int64 23 chronic_kidney_disease 6120 non-null int64 24 poisonings 6120 non-null int64 25 protein-energy_malnutrition 6120 non-null int64 26 road_injuries 6120 non-null int64 27 chronic_respiratory_diseases 6120 non-null int64 28 cirrhosis_and_other_chronic_liver_diseases 6120 non-null int64 29 digestive_diseases 6120 non-null int64 30 fire,_heat,_and_hot_substances 6120 non-null int64 31 acute_hepatitis 6120 non-null int64 32 total_no_of_deaths 6120 non-null int64 33 meningitis_change 6120 non-null float64 34 cumulative_deaths 6120 non-null int64 dtypes: float64(1), int32(1), int64(33) memory usage: 1.6 MB
# Convert 'meningitis_change' from float64 to int64
df['meningitis_change'] = df['meningitis_change'].astype(int)
# Verify the change
print(df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6120 entries, 0 to 6119 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 6120 non-null int32 1 meningitis 6120 non-null int64 2 alzheimers_disease_and_other_dementias 6120 non-null int64 3 parkinsons_disease 6120 non-null int64 4 nutritional_deficiencies 6120 non-null int64 5 malaria 6120 non-null int64 6 drowning 6120 non-null int64 7 interpersonal_violence 6120 non-null int64 8 maternal_disorders 6120 non-null int64 9 hiv/aids 6120 non-null int64 10 drug_use_disorders 6120 non-null int64 11 tuberculosis 6120 non-null int64 12 cardiovascular_diseases 6120 non-null int64 13 lower_respiratory_infections 6120 non-null int64 14 neonatal_disorders 6120 non-null int64 15 alcohol_use_disorders 6120 non-null int64 16 self-harm 6120 non-null int64 17 exposure_to_forces_of_nature 6120 non-null int64 18 diarrheal_diseases 6120 non-null int64 19 environmental_heat_and_cold_exposure 6120 non-null int64 20 neoplasms 6120 non-null int64 21 conflict_and_terrorism 6120 non-null int64 22 diabetes_mellitus 6120 non-null int64 23 chronic_kidney_disease 6120 non-null int64 24 poisonings 6120 non-null int64 25 protein-energy_malnutrition 6120 non-null int64 26 road_injuries 6120 non-null int64 27 chronic_respiratory_diseases 6120 non-null int64 28 cirrhosis_and_other_chronic_liver_diseases 6120 non-null int64 29 digestive_diseases 6120 non-null int64 30 fire,_heat,_and_hot_substances 6120 non-null int64 31 acute_hepatitis 6120 non-null int64 32 total_no_of_deaths 6120 non-null int64 33 meningitis_change 6120 non-null int32 34 cumulative_deaths 6120 non-null int64 dtypes: int32(2), int64(33) memory usage: 1.6 MB None
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Define features and target
X = df.drop(['total_no_of_deaths'], axis=1)
y = df['total_no_of_deaths']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Initialize and train the model
lin_reg_model = LinearRegression()
lin_reg_model.fit(X_train, y_train)
# Make predictions
y_pred_lin_reg = lin_reg_model.predict(X_test)
# Evaluate the model
mse_lin_reg = mean_squared_error(y_test, y_pred_lin_reg)
r2_lin_reg = r2_score(y_test, y_pred_lin_reg)
print(f"Linear Regression Mean Squared Error: {mse_lin_reg}")
print(f"Linear Regression R-squared: {r2_lin_reg}")
Linear Regression Mean Squared Error: 2.0415135847184325e-18 Linear Regression R-squared: 1.0
from sklearn.linear_model import Ridge
# Initialize Ridge Regression
ridge_model = Ridge(alpha=1.0) # Adjust alpha to control regularization strength
ridge_model.fit(X_train, y_train)
# Make predictions
y_pred_ridge = ridge_model.predict(X_test)
# Evaluate the model
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
r2_ridge = r2_score(y_test, y_pred_ridge)
print(f"Ridge Regression Mean Squared Error: {mse_ridge}")
print(f"Ridge Regression R-squared: {r2_ridge}")
Ridge Regression Mean Squared Error: 637395.7524604569 Ridge Regression R-squared: 0.9999989133618953
alphas = [0.1, 1.0, 10.0, 100.0]
for alpha in alphas:
ridge_model = Ridge(alpha=alpha)
ridge_model.fit(X_train, y_train)
y_pred_ridge = ridge_model.predict(X_test)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
r2_ridge = r2_score(y_test, y_pred_ridge)
print(f"Alpha: {alpha}")
print(f"Ridge Regression Mean Squared Error: {mse_ridge}")
print(f"Ridge Regression R-squared: {r2_ridge}")
Alpha: 0.1 Ridge Regression Mean Squared Error: 7541.972988469264 Ridge Regression R-squared: 0.9999999871423755 Alpha: 1.0 Ridge Regression Mean Squared Error: 637395.7524604569 Ridge Regression R-squared: 0.9999989133618953 Alpha: 10.0 Ridge Regression Mean Squared Error: 26572135.303873844 Ridge Regression R-squared: 0.9999546995808584 Alpha: 100.0 Ridge Regression Mean Squared Error: 308119194.9415028 Ridge Regression R-squared: 0.999474715580181
from sklearn.linear_model import Lasso
# Initialize Lasso Regression
lasso_model = Lasso(alpha=1.0) # Adjust alpha as needed
lasso_model.fit(X_train, y_train)
# Make predictions
y_pred_lasso = lasso_model.predict(X_test)
# Evaluate the model
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
r2_lasso = r2_score(y_test, y_pred_lasso)
print(f"Lasso Regression Mean Squared Error: {mse_lasso}")
print(f"Lasso Regression R-squared: {r2_lasso}")
Lasso Regression Mean Squared Error: 271204822.12131757 Lasso Regression R-squared: 0.999537647540371
D:\SAMAANACONDA\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:697: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 5.692e+11, tolerance: 3.953e+11
from sklearn.ensemble import RandomForestRegressor
# Initialize Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Make predictions
y_pred_rf = rf_model.predict(X_test)
# Evaluate the model
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)
print(f"Random Forest Mean Squared Error: {mse_rf}")
print(f"Random Forest R-squared: {r2_rf}")
Random Forest Mean Squared Error: 441563295.6740831 Random Forest R-squared: 0.9992472188575413
from sklearn.model_selection import cross_val_score
# Cross-validation with Ridge Regression
cv_scores_ridge = cross_val_score(ridge_model, X, y, cv=5, scoring='neg_mean_squared_error')
print(f"Cross-Validated Mean Squared Error (Ridge): {-cv_scores_ridge.mean()}")
# Cross-validation with Lasso Regression
cv_scores_lasso = cross_val_score(lasso_model, X, y, cv=5, scoring='neg_mean_squared_error')
print(f"Cross-Validated Mean Squared Error (Lasso): {-cv_scores_lasso.mean()}")
Cross-Validated Mean Squared Error (Ridge): 3.5722394712611494e-07
D:\SAMAANACONDA\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:697: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.501e+11, tolerance: 2.362e+11 D:\SAMAANACONDA\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:697: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.994e+11, tolerance: 3.809e+11 D:\SAMAANACONDA\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:697: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 5.879e+11, tolerance: 4.456e+11
Cross-Validated Mean Squared Error (Lasso): 3183700565.571262
D:\SAMAANACONDA\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:697: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 6.747e+11, tolerance: 4.485e+11
# Hyperparameter tuning
# ridge reg
from sklearn.model_selection import GridSearchCV
# Define parameter grid
parameters_ridge = {'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}
ridge_model = Ridge()
# Perform grid search
grid_search_ridge = GridSearchCV(ridge_model, parameters_ridge, scoring='neg_mean_squared_error', cv=5)
grid_search_ridge.fit(X_train, y_train)
# Best parameters and score
print(f"Best Parameters (Ridge): {grid_search_ridge.best_params_}")
print(f"Best Score (Ridge): {-grid_search_ridge.best_score_}")
Best Parameters (Ridge): {'alpha': 0.01}
Best Score (Ridge): 149.56717271873725
# lasso reg
# Define parameter grid for Lasso
parameters_lasso = {'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}
lasso_model = Lasso(max_iter=10000)
# Perform grid search
grid_search_lasso = GridSearchCV(lasso_model, parameters_lasso, scoring='neg_mean_squared_error', cv=5)
grid_search_lasso.fit(X_train, y_train)
# Best parameters and score
print(f"Best Parameters (Lasso): {grid_search_lasso.best_params_}")
print(f"Best Score (Lasso): {-grid_search_lasso.best_score_}")
Best Parameters (Lasso): {'alpha': 0.01}
Best Score (Lasso): 2120883.2258921517
# Define features and target
X = df.drop(['total_no_of_deaths'], axis=1)
y = df['total_no_of_deaths']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define categorical and numerical features
categorical_features = X.select_dtypes(include=['object']).columns
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
# Create preprocessing pipelines
preprocessor = ColumnTransformer(
transformers=[
('num', Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
]), numerical_features),
('cat', Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
]), categorical_features)
]
)
# Apply preprocessing to both training and testing data
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_test_preprocessed = preprocessor.transform(X_test)
# Initialize models
ridge_model = Ridge()
lasso_model = Lasso(max_iter=10000)
rf_model = RandomForestRegressor()
gb_model = GradientBoostingRegressor()
# Define parameter grids for GridSearchCV
parameters_ridge = {'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}
parameters_lasso = {'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}
# Perform grid search for Ridge and Lasso
grid_search_ridge = GridSearchCV(ridge_model, parameters_ridge, scoring='neg_mean_squared_error', cv=5)
grid_search_ridge.fit(X_train_preprocessed, y_train)
best_ridge_model = grid_search_ridge.best_estimator_
grid_search_lasso = GridSearchCV(lasso_model, parameters_lasso, scoring='neg_mean_squared_error', cv=5)
grid_search_lasso.fit(X_train_preprocessed, y_train)
best_lasso_model = grid_search_lasso.best_estimator_
# Train Random Forest and Gradient Boosting models
rf_model.fit(X_train_preprocessed, y_train)
gb_model.fit(X_train_preprocessed, y_train)
# Make predictions
y_pred_ridge = best_ridge_model.predict(X_test_preprocessed)
y_pred_lasso = best_lasso_model.predict(X_test_preprocessed)
y_pred_rf = rf_model.predict(X_test_preprocessed)
y_pred_gb = gb_model.predict(X_test_preprocessed)
# Evaluate models
def evaluate_model(y_true, y_pred, model_name):
mse = mean_squared_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"{model_name} Mean Squared Error: {mse}")
print(f"{model_name} R-squared: {r2}")
evaluate_model(y_test, y_pred_ridge, "Ridge Regression")
evaluate_model(y_test, y_pred_lasso, "Lasso Regression")
evaluate_model(y_test, y_pred_rf, "Random Forest")
evaluate_model(y_test, y_pred_gb, "Gradient Boosting")
# Cross-validation scores
cv_scores_ridge = cross_val_score(best_ridge_model, preprocessor.transform(X), y, cv=5, scoring='neg_mean_squared_error')
cv_scores_lasso = cross_val_score(best_lasso_model, preprocessor.transform(X), y, cv=5, scoring='neg_mean_squared_error')
print(f"Cross-Validated Mean Squared Error (Ridge): {-cv_scores_ridge.mean()}")
print(f"Cross-Validated Mean Squared Error (Lasso): {-cv_scores_lasso.mean()}")
Ridge Regression Mean Squared Error: 75.21693165381222 Ridge Regression R-squared: 0.9999999998717695 Lasso Regression Mean Squared Error: 2109596.3646646277 Lasso Regression R-squared: 0.9999964035408353 Random Forest Mean Squared Error: 412402826.7590321 Random Forest R-squared: 0.9992969318914813 Gradient Boosting Mean Squared Error: 607484604.8760853 Gradient Boosting R-squared: 0.9989643546930536 Cross-Validated Mean Squared Error (Ridge): 2917.6510757034894 Cross-Validated Mean Squared Error (Lasso): 24126655.184786927
# advanced techniques
from sklearn.ensemble import GradientBoostingRegressor
# Initialize and train Gradient Boosting Regressor
gb_model = GradientBoostingRegressor()
gb_model.fit(X_train_preprocessed, y_train)
# Make predictions
y_pred_gb = gb_model.predict(X_test_preprocessed)
# Evaluate the Gradient Boosting model
evaluate_model(y_test, y_pred_gb, "Gradient Boosting")
Gradient Boosting Mean Squared Error: 646083812.675353 Gradient Boosting R-squared: 0.9988985504107915
Ridge Regression and Lasso Regression both show exceptionally high R² values, indicating that they explain almost all the variance in the data. However, Lasso Regression has a much higher MSE, which might suggest it is less suited for this dataset compared to Ridge Regression.
Random Forest and Gradient Boosting have lower R² values compared to Ridge and Lasso, and their MSE values are significantly higher. This suggests that, while these models are capturing some patterns, they may be overfitting or may not be as well-tuned for this specific problem.
The cross-validated MSE values for Ridge and Lasso show a significant difference from the training MSE values, indicating that while these models perform very well on training data, their performance on unseen data might be less impressive.
# hyperparameter tuning using GridSearchCV:
from sklearn.model_selection import GridSearchCV
# Define parameter grids for Random Forest and Gradient Boosting
parameters_rf = {
'n_estimators': [100, 200, 300],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2]
}
parameters_gb = {
'n_estimators': [100, 200, 300],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 5, 7],
'subsample': [0.8, 0.9, 1.0]
}
# Initialize models
rf_model = RandomForestRegressor()
gb_model = GradientBoostingRegressor()
# Grid Search for Random Forest
grid_search_rf = GridSearchCV(rf_model, parameters_rf, cv=5, scoring='neg_mean_squared_error')
grid_search_rf.fit(X_train_preprocessed, y_train)
best_rf_model = grid_search_rf.best_estimator_
# Grid Search for Gradient Boosting
grid_search_gb = GridSearchCV(gb_model, parameters_gb, cv=5, scoring='neg_mean_squared_error')
grid_search_gb.fit(X_train_preprocessed, y_train)
best_gb_model = grid_search_gb.best_estimator_
# Make predictions with tuned models
y_pred_rf = best_rf_model.predict(X_test_preprocessed)
y_pred_gb = best_gb_model.predict(X_test_preprocessed)
# Evaluate models
evaluate_model(y_test, y_pred_rf, "Tuned Random Forest")
evaluate_model(y_test, y_pred_gb, "Tuned Gradient Boosting")
Tuned Random Forest Mean Squared Error: 423071262.43593276 Tuned Random Forest R-squared: 0.9992787442448273 Tuned Gradient Boosting Mean Squared Error: 351335976.6355588 Tuned Gradient Boosting R-squared: 0.9994010392157373
# Feature Engineering
# Feature Selection: Remove irrelevant or redundant features to improve model performance. You can use techniques like Recursive Feature Elimination (RFE) or feature importance scores.
from sklearn.feature_selection import RFE
from sklearn.linear_model import Ridge
# Initialize the model and RFE
model = Ridge()
rfe = RFE(model, n_features_to_select=10)
# Fit RFE
X_train_rfe = rfe.fit_transform(X_train_preprocessed, y_train)
X_test_rfe = rfe.transform(X_test_preprocessed)
# Use the selected features to fit the model
model.fit(X_train_rfe, y_train)
y_pred = model.predict(X_test_rfe)
evaluate_model(y_test, y_pred, "Ridge with RFE")
Ridge with RFE Mean Squared Error: 1016822564.5699499 Ridge with RFE R-squared: 0.9982665115979211
# Model Validation
# Cross-Validation: Perform cross-validation to ensure that the model is not overfitting and performs well on unseen data.
from sklearn.model_selection import cross_val_score
# Cross-validation scores
cv_scores_rf = cross_val_score(best_rf_model, X_train_preprocessed, y_train, cv=5, scoring='neg_mean_squared_error')
cv_scores_gb = cross_val_score(best_gb_model, X_train_preprocessed, y_train, cv=5, scoring='neg_mean_squared_error')
print(f"Cross-Validated Mean Squared Error (Random Forest): {-cv_scores_rf.mean()}")
print(f"Cross-Validated Mean Squared Error (Gradient Boosting): {-cv_scores_gb.mean()}")
Cross-Validated Mean Squared Error (Random Forest): 491233682.1143919 Cross-Validated Mean Squared Error (Gradient Boosting): 320611482.3465633
# Data Quality
# Review Data Preprocessing: Check for any issues in data preprocessing and ensure all necessary transformations are applied correctly.
# Review data preprocessing steps
print(X_train_preprocessed.shape)
print(X_test_preprocessed.shape)
(4896, 32) (1224, 32)
from sklearn.ensemble import RandomForestRegressor
# Initialize and train the model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Make predictions
rf_y_pred = rf_model.predict(X_test)
# Evaluate the model
rf_mse = mean_squared_error(y_test, rf_y_pred)
rf_r2 = r2_score(y_test, rf_y_pred)
print(f"Random Forest Mean Squared Error: {rf_mse}")
print(f"Random Forest R-squared: {rf_r2}")
Random Forest Mean Squared Error: 441563295.6740831 Random Forest R-squared: 0.9992472188575413
from sklearn.model_selection import cross_val_score
# Linear Regression with cross-validation
lr_model = LinearRegression()
lr_scores = cross_val_score(lr_model, X, y, cv=5, scoring='neg_mean_squared_error')
print(f"Linear Regression Mean Cross-Validated MSE: {-lr_scores.mean()}")
# Random Forest with cross-validation
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_scores = cross_val_score(rf_model, X, y, cv=5, scoring='neg_mean_squared_error')
print(f"Random Forest Mean Cross-Validated MSE: {-rf_scores.mean()}")
Linear Regression Mean Cross-Validated MSE: 6.117124717691846e-14 Random Forest Mean Cross-Validated MSE: 71124482809.29434
import matplotlib.pyplot as plt
import seaborn as sns
# Train Linear Regression Model
lr_model.fit(X_train, y_train)
y_train_pred = lr_model.predict(X_train)
residuals = y_train - y_train_pred
# Plot residuals
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_train_pred, y=residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values')
plt.show()
import pandas as pd
# Train Random Forest Model
rf_model.fit(X_train, y_train)
# Get feature importances
importances = rf_model.feature_importances_
feature_names = X.columns
feature_importances = pd.Series(importances, index=feature_names).sort_values(ascending=False)
# Plot feature importances
plt.figure(figsize=(12, 8))
feature_importances.plot(kind='bar')
plt.title('Feature Importances from Random Forest')
plt.show()
# Linear Regression Evaluation
lr_model.fit(X_train, y_train)
y_test_pred_lr = lr_model.predict(X_test)
mse_lr = mean_squared_error(y_test, y_test_pred_lr)
r2_lr = r2_score(y_test, y_test_pred_lr)
print(f"Linear Regression Mean Squared Error: {mse_lr}")
print(f"Linear Regression R-squared: {r2_lr}")
# Random Forest Evaluation
rf_y_pred = rf_model.predict(X_test)
mse_rf = mean_squared_error(y_test, rf_y_pred)
r2_rf = r2_score(y_test, rf_y_pred)
print(f"Random Forest Mean Squared Error: {mse_rf}")
print(f"Random Forest R-squared: {r2_rf}")
Linear Regression Mean Squared Error: 2.0415135847184325e-18 Linear Regression R-squared: 1.0 Random Forest Mean Squared Error: 441563295.6740831 Random Forest R-squared: 0.9992472188575413
Check outliers
def detect_outliers_iqr(df, column):
"""Detect outliers using the IQR method."""
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return lower_bound, upper_bound
def plot_boxplot(df, column):
"""Plot a boxplot for visualizing outliers."""
plt.figure(figsize=(10, 6))
sns.boxplot(df[column])
plt.title(f'Boxplot for {column}')
plt.show()
# Check and handle outliers
for column in df.columns:
if df[column].dtype in [np.float64, np.int64]: # Only numeric columns
lower_bound, upper_bound = detect_outliers_iqr(df, column)
outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
print(f"Outliers in column '{column}': {len(outliers)}")
# Plot boxplot to visualize outliers
plot_boxplot(df, column)
# Handle outliers by capping values
df.loc[df[column] < lower_bound, column] = lower_bound
df.loc[df[column] > upper_bound, column] = upper_bound
# Verify if outliers are handled
for column in df.columns:
if df[column].dtype in [np.float64, np.int64]: # Only numeric columns
lower_bound, upper_bound = detect_outliers_iqr(df, column)
outliers_after = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
print(f"Outliers in column '{column}' after handling: {len(outliers_after)}")
Outliers in column 'meningitis': 1029
Outliers in column 'alzheimers_disease_and_other_dementias': 819
Outliers in column 'parkinsons_disease': 811
Outliers in column 'nutritional_deficiencies': 950
Outliers in column 'malaria': 1278
Outliers in column 'drowning': 733
Outliers in column 'interpersonal_violence': 841
Outliers in column 'maternal_disorders': 789
Outliers in column 'hiv/aids': 1041
Outliers in column 'drug_use_disorders': 725
Outliers in column 'tuberculosis': 916
Outliers in column 'cardiovascular_diseases': 732
Outliers in column 'lower_respiratory_infections': 593
Outliers in column 'neonatal_disorders': 777
Outliers in column 'alcohol_use_disorders': 685
Outliers in column 'self-harm': 722
Outliers in column 'exposure_to_forces_of_nature': 1025
Outliers in column 'diarrheal_diseases': 926
Outliers in column 'environmental_heat_and_cold_exposure': 559
Outliers in column 'neoplasms': 768
Outliers in column 'conflict_and_terrorism': 1188
Outliers in column 'diabetes_mellitus': 872
Outliers in column 'chronic_kidney_disease': 787
Outliers in column 'poisonings': 580
Outliers in column 'protein-energy_malnutrition': 994
Outliers in column 'road_injuries': 765
Outliers in column 'chronic_respiratory_diseases': 918
Outliers in column 'cirrhosis_and_other_chronic_liver_diseases': 796
Outliers in column 'digestive_diseases': 812
Outliers in column 'fire,_heat,_and_hot_substances': 562
Outliers in column 'acute_hepatitis': 802
Outliers in column 'total_no_of_deaths': 712
Outliers in column 'cumulative_deaths': 693
Outliers in column 'meningitis' after handling: 0 Outliers in column 'alzheimers_disease_and_other_dementias' after handling: 0 Outliers in column 'parkinsons_disease' after handling: 0 Outliers in column 'nutritional_deficiencies' after handling: 0 Outliers in column 'malaria' after handling: 0 Outliers in column 'drowning' after handling: 0 Outliers in column 'interpersonal_violence' after handling: 0 Outliers in column 'maternal_disorders' after handling: 0 Outliers in column 'hiv/aids' after handling: 0 Outliers in column 'drug_use_disorders' after handling: 0 Outliers in column 'tuberculosis' after handling: 0 Outliers in column 'cardiovascular_diseases' after handling: 0 Outliers in column 'lower_respiratory_infections' after handling: 0 Outliers in column 'neonatal_disorders' after handling: 0 Outliers in column 'alcohol_use_disorders' after handling: 0 Outliers in column 'self-harm' after handling: 0 Outliers in column 'exposure_to_forces_of_nature' after handling: 0 Outliers in column 'diarrheal_diseases' after handling: 0 Outliers in column 'environmental_heat_and_cold_exposure' after handling: 0 Outliers in column 'neoplasms' after handling: 0 Outliers in column 'conflict_and_terrorism' after handling: 0 Outliers in column 'diabetes_mellitus' after handling: 0 Outliers in column 'chronic_kidney_disease' after handling: 0 Outliers in column 'poisonings' after handling: 0 Outliers in column 'protein-energy_malnutrition' after handling: 0 Outliers in column 'road_injuries' after handling: 0 Outliers in column 'chronic_respiratory_diseases' after handling: 0 Outliers in column 'cirrhosis_and_other_chronic_liver_diseases' after handling: 0 Outliers in column 'digestive_diseases' after handling: 0 Outliers in column 'fire,_heat,_and_hot_substances' after handling: 0 Outliers in column 'acute_hepatitis' after handling: 0 Outliers in column 'total_no_of_deaths' after handling: 0 Outliers in column 'cumulative_deaths' after handling: 0
def plot_histogram(df, column):
"""Plot a histogram for visualizing the distribution of data."""
plt.figure(figsize=(10, 6))
sns.histplot(df[column], kde=True, bins=30)
plt.title(f'Histogram for {column}')
plt.show()
# Verify the distribution after handling outliers
for column in df.columns:
if df[column].dtype in [np.float64, np.int64]: # Only numeric columns
plot_histogram(df, column)
import pandas as pd
# Assuming df_original is your original DataFrame
# Apply your outlier handling method to create df_cleaned
# For example, let's assume you have used Z-score or IQR to handle outliers
# Example code for handling outliers using IQR
def handle_outliers_iqr(df):
df_cleaned = df.copy()
for column in df.columns:
if df[column].dtype in [float, int]: # Apply only to numeric columns
Q1 = df_cleaned[column].quantile(0.25)
Q3 = df_cleaned[column].quantile(0.75)
IQR = Q3 - Q1
df_cleaned = df_cleaned[(df_cleaned[column] >= (Q1 - 1.5 * IQR)) & (df_cleaned[column] <= (Q3 + 1.5 * IQR))]
return df_cleaned
# Define df_cleaned by handling outliers
df_cleaned = handle_outliers_iqr(df)
# Descriptive Statistics Function
def descriptive_statistics(df):
return df.describe().T
# Print descriptive statistics before handling outliers
print("Descriptive Statistics Before Handling Outliers")
print(descriptive_statistics(df))
# Print descriptive statistics after handling outliers
print("\nDescriptive Statistics After Handling Outliers")
print(descriptive_statistics(df_cleaned))
Descriptive Statistics Before Handling Outliers
count mean std \
year 6120.00 2004.50 8.66
meningitis 6120.00 558.62 784.30
alzheimers_disease_and_other_dementias 6120.00 1677.19 2105.45
parkinsons_disease 6120.00 412.24 513.54
nutritional_deficiencies 6120.00 739.49 1085.38
malaria 6120.00 245.05 403.45
drowning 6120.00 468.94 574.04
interpersonal_violence 6120.00 612.05 740.67
maternal_disorders 6120.00 453.71 664.04
hiv/aids 6120.00 1198.91 1790.61
drug_use_disorders 6120.00 81.66 110.52
tuberculosis 6120.00 1919.56 2691.19
cardiovascular_diseases 6120.00 28583.10 35091.24
lower_respiratory_infections 6120.00 6646.39 8377.32
neonatal_disorders 6120.00 4667.30 6560.24
alcohol_use_disorders 6120.00 206.37 260.79
self-harm 6120.00 1256.36 1543.21
exposure_to_forces_of_nature 6120.00 7.44 11.56
diarrheal_diseases 6120.00 2518.44 3707.42
environmental_heat_and_cold_exposure 6120.00 66.75 86.90
neoplasms 6120.00 13440.43 16817.80
conflict_and_terrorism 6120.00 14.63 22.94
diabetes_mellitus 6120.00 2103.89 2417.35
chronic_kidney_disease 6120.00 1988.02 2421.58
poisonings 6120.00 161.24 207.59
protein-energy_malnutrition 6120.00 659.48 982.64
road_injuries 6120.00 2380.77 2912.78
chronic_respiratory_diseases 6120.00 3656.55 4472.84
cirrhosis_and_other_chronic_liver_diseases 6120.00 2470.60 2958.07
digestive_diseases 6120.00 4303.08 5049.07
fire,_heat,_and_hot_substances 6120.00 284.92 350.75
acute_hepatitis 6120.00 101.45 143.79
total_no_of_deaths 6120.00 107820.60 128801.21
meningitis_change 6120.00 -21.16 919.88
cumulative_deaths 6120.00 1483344.83 1879915.84
min 25% 50% \
year 1990.00 1997.00 2004.50
meningitis 0.00 15.00 109.00
alzheimers_disease_and_other_dementias 0.00 90.00 666.50
parkinsons_disease 0.00 27.00 164.00
nutritional_deficiencies 0.00 9.00 119.00
malaria 0.00 0.00 0.00
drowning 0.00 34.00 177.00
interpersonal_violence 0.00 40.00 265.00
maternal_disorders 0.00 5.00 54.00
hiv/aids 0.00 11.00 136.00
drug_use_disorders 0.00 3.00 20.00
tuberculosis 0.00 35.00 417.00
cardiovascular_diseases 4.00 2028.00 11742.00
lower_respiratory_infections 0.00 345.00 2126.50
neonatal_disorders 0.00 131.00 916.00
alcohol_use_disorders 0.00 9.00 80.00
self-harm 0.00 94.00 533.00
exposure_to_forces_of_nature 0.00 0.00 0.00
diarrheal_diseases 0.00 20.00 296.50
environmental_heat_and_cold_exposure 0.00 2.00 21.00
neoplasms 1.00 809.75 5629.50
conflict_and_terrorism 0.00 0.00 0.00
diabetes_mellitus 1.00 236.00 1087.00
chronic_kidney_disease 0.00 145.75 822.00
poisonings 0.00 6.00 52.50
protein-energy_malnutrition 0.00 5.00 92.00
road_injuries 0.00 174.75 966.50
chronic_respiratory_diseases 1.00 289.00 1689.00
cirrhosis_and_other_chronic_liver_diseases 0.00 154.00 1210.00
digestive_diseases 0.00 284.00 2185.00
fire,_heat,_and_hot_substances 0.00 17.00 126.00
acute_hepatitis 0.00 2.00 15.00
total_no_of_deaths 7.00 6935.00 50257.50
meningitis_change -10728.00 -11.00 -1.00
cumulative_deaths 13.00 71995.25 553431.50
75% max
year 2012.00 2019.00
meningitis 847.25 2095.62
alzheimers_disease_and_other_dementias 2456.25 6005.62
parkinsons_disease 609.25 1482.62
nutritional_deficiencies 1167.25 2904.62
malaria 393.00 982.50
drowning 698.00 1694.00
interpersonal_violence 877.00 2132.50
maternal_disorders 734.00 1827.50
hiv/aids 1879.00 4681.00
drug_use_disorders 129.00 318.00
tuberculosis 2924.25 7258.12
cardiovascular_diseases 42546.50 103324.25
lower_respiratory_infections 10161.25 24885.62
neonatal_disorders 7419.75 18352.88
alcohol_use_disorders 316.00 776.50
self-harm 1882.25 4564.62
exposure_to_forces_of_nature 12.00 30.00
diarrheal_diseases 3946.75 9836.88
environmental_heat_and_cold_exposure 109.00 269.50
neoplasms 20147.75 49154.75
conflict_and_terrorism 23.00 57.50
diabetes_mellitus 2954.00 7031.00
chronic_kidney_disease 2922.50 7087.62
poisonings 254.00 626.00
protein-energy_malnutrition 1042.50 2598.75
road_injuries 3435.25 8326.00
chronic_respiratory_diseases 5249.75 12690.88
cirrhosis_and_other_chronic_liver_diseases 3547.25 8637.12
digestive_diseases 6080.00 14774.00
fire,_heat,_and_hot_substances 450.00 1099.50
acute_hepatitis 160.00 397.00
total_no_of_deaths 158221.00 385150.00
meningitis_change 0.00 53333.00
cumulative_deaths 2266613.50 5558540.88
Descriptive Statistics After Handling Outliers
count mean std \
year 4070.00 2004.07 8.65
meningitis 4070.00 173.58 395.37
alzheimers_disease_and_other_dementias 4070.00 921.02 1412.46
parkinsons_disease 4070.00 221.39 336.87
nutritional_deficiencies 4070.00 246.26 612.31
malaria 4070.00 104.94 275.86
drowning 4070.00 203.85 341.15
interpersonal_violence 4070.00 313.91 517.93
maternal_disorders 4070.00 151.86 361.25
hiv/aids 4070.00 494.25 1149.79
drug_use_disorders 4070.00 47.15 81.68
tuberculosis 4070.00 631.03 1409.98
cardiovascular_diseases 4070.00 14155.90 21033.66
lower_respiratory_infections 4070.00 2273.97 4004.72
neonatal_disorders 4070.00 1509.61 3258.89
alcohol_use_disorders 4070.00 134.16 205.64
self-harm 4070.00 590.95 933.31
exposure_to_forces_of_nature 4070.00 3.82 8.42
diarrheal_diseases 4070.00 838.16 2051.25
environmental_heat_and_cold_exposure 4070.00 34.06 63.09
neoplasms 4070.00 6906.92 10548.30
conflict_and_terrorism 4070.00 7.24 17.01
diabetes_mellitus 4070.00 1015.66 1419.24
chronic_kidney_disease 4070.00 902.29 1401.46
poisonings 4070.00 59.90 114.27
protein-energy_malnutrition 4070.00 222.79 559.27
road_injuries 4070.00 1025.76 1734.52
chronic_respiratory_diseases 4070.00 1639.05 2528.57
cirrhosis_and_other_chronic_liver_diseases 4070.00 1043.32 1596.46
digestive_diseases 4070.00 1859.30 2719.22
fire,_heat,_and_hot_substances 4070.00 118.69 198.72
acute_hepatitis 4070.00 32.77 76.06
total_no_of_deaths 4070.00 41609.82 61383.73
meningitis_change 4070.00 -2.10 5.94
cumulative_deaths 4070.00 484839.98 654676.59
min 25% 50% \
year 1990.00 1997.00 2004.00
meningitis 0.00 4.00 32.00
alzheimers_disease_and_other_dementias 0.00 28.00 275.00
parkinsons_disease 0.00 9.00 71.00
nutritional_deficiencies 0.00 4.00 17.00
malaria 0.00 0.00 0.00
drowning 0.00 14.00 66.00
interpersonal_violence 0.00 16.00 99.50
maternal_disorders 0.00 2.00 11.00
hiv/aids 0.00 4.00 37.00
drug_use_disorders 0.00 1.00 9.00
tuberculosis 0.00 12.00 73.50
cardiovascular_diseases 4.00 640.50 5170.00
lower_respiratory_infections 0.00 86.00 678.00
neonatal_disorders 0.00 38.00 242.00
alcohol_use_disorders 0.00 5.00 25.00
self-harm 0.00 31.00 212.00
exposure_to_forces_of_nature 0.00 0.00 0.00
diarrheal_diseases 0.00 6.25 60.00
environmental_heat_and_cold_exposure 0.00 1.00 5.00
neoplasms 1.00 295.25 2483.00
conflict_and_terrorism 0.00 0.00 0.00
diabetes_mellitus 1.00 91.25 481.50
chronic_kidney_disease 0.00 58.00 316.50
poisonings 0.00 2.00 15.00
protein-energy_malnutrition 0.00 2.00 12.00
road_injuries 0.00 45.00 395.50
chronic_respiratory_diseases 1.00 81.00 592.50
cirrhosis_and_other_chronic_liver_diseases 0.00 42.00 340.50
digestive_diseases 0.00 85.25 674.50
fire,_heat,_and_hot_substances 0.00 4.00 41.00
acute_hepatitis 0.00 1.00 4.00
total_no_of_deaths 7.00 1888.25 18539.00
meningitis_change -27.00 -2.00 0.00
cumulative_deaths 13.00 21194.50 177374.00
75% max
year 2011.00 2019.00
meningitis 125.00 2095.62
alzheimers_disease_and_other_dementias 1173.00 6005.62
parkinsons_disease 279.00 1482.62
nutritional_deficiencies 149.00 2904.62
malaria 2.00 982.50
drowning 216.75 1694.00
interpersonal_violence 355.75 2132.50
maternal_disorders 76.00 1827.50
hiv/aids 235.00 4681.00
drug_use_disorders 49.00 318.00
tuberculosis 500.75 7258.12
cardiovascular_diseases 19218.25 103324.25
lower_respiratory_infections 2303.25 24885.62
neonatal_disorders 1280.50 18352.88
alcohol_use_disorders 204.00 776.50
self-harm 656.00 4564.62
exposure_to_forces_of_nature 2.00 30.00
diarrheal_diseases 426.25 9836.88
environmental_heat_and_cold_exposure 32.00 269.50
neoplasms 8220.00 49154.75
conflict_and_terrorism 2.00 57.50
diabetes_mellitus 1374.75 7031.00
chronic_kidney_disease 1070.75 7087.62
poisonings 60.00 626.00
protein-energy_malnutrition 126.00 2598.75
road_injuries 1131.50 8326.00
chronic_respiratory_diseases 2077.25 12690.88
cirrhosis_and_other_chronic_liver_diseases 1476.75 8637.12
digestive_diseases 2627.00 14774.00
fire,_heat,_and_hot_substances 139.75 1099.50
acute_hepatitis 21.00 397.00
total_no_of_deaths 53836.25 385150.00
meningitis_change 0.00 16.00
cumulative_deaths 696965.25 2813393.00
# Calculate summary statistics before handling outliers
stats_before = df.describe().T[['mean', '50%', 'std']]
stats_before.columns = ['mean_before', 'median_before', 'std_before']
# Calculate summary statistics after handling outliers
stats_after = df_cleaned.describe().T[['mean', '50%', 'std']]
stats_after.columns = ['mean_after', 'median_after', 'std_after']
# Merge statistics before and after
stats_comparison = pd.concat([stats_before, stats_after], axis=1)
print(stats_comparison)
mean_before median_before \
year 2004.50 2004.50
meningitis 558.62 109.00
alzheimers_disease_and_other_dementias 1677.19 666.50
parkinsons_disease 412.24 164.00
nutritional_deficiencies 739.49 119.00
malaria 245.05 0.00
drowning 468.94 177.00
interpersonal_violence 612.05 265.00
maternal_disorders 453.71 54.00
hiv/aids 1198.91 136.00
drug_use_disorders 81.66 20.00
tuberculosis 1919.56 417.00
cardiovascular_diseases 28583.10 11742.00
lower_respiratory_infections 6646.39 2126.50
neonatal_disorders 4667.30 916.00
alcohol_use_disorders 206.37 80.00
self-harm 1256.36 533.00
exposure_to_forces_of_nature 7.44 0.00
diarrheal_diseases 2518.44 296.50
environmental_heat_and_cold_exposure 66.75 21.00
neoplasms 13440.43 5629.50
conflict_and_terrorism 14.63 0.00
diabetes_mellitus 2103.89 1087.00
chronic_kidney_disease 1988.02 822.00
poisonings 161.24 52.50
protein-energy_malnutrition 659.48 92.00
road_injuries 2380.77 966.50
chronic_respiratory_diseases 3656.55 1689.00
cirrhosis_and_other_chronic_liver_diseases 2470.60 1210.00
digestive_diseases 4303.08 2185.00
fire,_heat,_and_hot_substances 284.92 126.00
acute_hepatitis 101.45 15.00
total_no_of_deaths 107820.60 50257.50
meningitis_change -21.16 -1.00
cumulative_deaths 1483344.83 553431.50
std_before mean_after \
year 8.66 2004.07
meningitis 784.30 173.58
alzheimers_disease_and_other_dementias 2105.45 921.02
parkinsons_disease 513.54 221.39
nutritional_deficiencies 1085.38 246.26
malaria 403.45 104.94
drowning 574.04 203.85
interpersonal_violence 740.67 313.91
maternal_disorders 664.04 151.86
hiv/aids 1790.61 494.25
drug_use_disorders 110.52 47.15
tuberculosis 2691.19 631.03
cardiovascular_diseases 35091.24 14155.90
lower_respiratory_infections 8377.32 2273.97
neonatal_disorders 6560.24 1509.61
alcohol_use_disorders 260.79 134.16
self-harm 1543.21 590.95
exposure_to_forces_of_nature 11.56 3.82
diarrheal_diseases 3707.42 838.16
environmental_heat_and_cold_exposure 86.90 34.06
neoplasms 16817.80 6906.92
conflict_and_terrorism 22.94 7.24
diabetes_mellitus 2417.35 1015.66
chronic_kidney_disease 2421.58 902.29
poisonings 207.59 59.90
protein-energy_malnutrition 982.64 222.79
road_injuries 2912.78 1025.76
chronic_respiratory_diseases 4472.84 1639.05
cirrhosis_and_other_chronic_liver_diseases 2958.07 1043.32
digestive_diseases 5049.07 1859.30
fire,_heat,_and_hot_substances 350.75 118.69
acute_hepatitis 143.79 32.77
total_no_of_deaths 128801.21 41609.82
meningitis_change 919.88 -2.10
cumulative_deaths 1879915.84 484839.98
median_after std_after
year 2004.00 8.65
meningitis 32.00 395.37
alzheimers_disease_and_other_dementias 275.00 1412.46
parkinsons_disease 71.00 336.87
nutritional_deficiencies 17.00 612.31
malaria 0.00 275.86
drowning 66.00 341.15
interpersonal_violence 99.50 517.93
maternal_disorders 11.00 361.25
hiv/aids 37.00 1149.79
drug_use_disorders 9.00 81.68
tuberculosis 73.50 1409.98
cardiovascular_diseases 5170.00 21033.66
lower_respiratory_infections 678.00 4004.72
neonatal_disorders 242.00 3258.89
alcohol_use_disorders 25.00 205.64
self-harm 212.00 933.31
exposure_to_forces_of_nature 0.00 8.42
diarrheal_diseases 60.00 2051.25
environmental_heat_and_cold_exposure 5.00 63.09
neoplasms 2483.00 10548.30
conflict_and_terrorism 0.00 17.01
diabetes_mellitus 481.50 1419.24
chronic_kidney_disease 316.50 1401.46
poisonings 15.00 114.27
protein-energy_malnutrition 12.00 559.27
road_injuries 395.50 1734.52
chronic_respiratory_diseases 592.50 2528.57
cirrhosis_and_other_chronic_liver_diseases 340.50 1596.46
digestive_diseases 674.50 2719.22
fire,_heat,_and_hot_substances 41.00 198.72
acute_hepatitis 4.00 76.06
total_no_of_deaths 18539.00 61383.73
meningitis_change 0.00 5.94
cumulative_deaths 177374.00 654676.59
Modeling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Define features and target
X = df.drop(['total_no_of_deaths'], axis=1)
y = df['total_no_of_deaths']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Linear Regression to predict future trends:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
Mean Squared Error: 494300444.4112894 R-squared: 0.9708188257223398
Random Forest:
from sklearn.ensemble import RandomForestRegressor
# Initialize and train the model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Make predictions
rf_y_pred = rf_model.predict(X_test)
# Evaluate the model
rf_mse = mean_squared_error(y_test, rf_y_pred)
rf_r2 = r2_score(y_test, rf_y_pred)
print(f"Random Forest Mean Squared Error: {rf_mse}")
print(f"Random Forest R-squared: {rf_r2}")
Random Forest Mean Squared Error: 131796235.34221615 Random Forest R-squared: 0.9922193699072206
Cross-Validation
Use cross-validation to assess model performance more reliably:
from sklearn.model_selection import cross_val_score
# Linear Regression with cross-validation
lr_model = LinearRegression()
lr_scores = cross_val_score(lr_model, X, y, cv=5, scoring='neg_mean_squared_error')
print(f"Linear Regression Mean Cross-Validated MSE: {-lr_scores.mean()}")
# Random Forest with cross-validation
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_scores = cross_val_score(rf_model, X, y, cv=5, scoring='neg_mean_squared_error')
print(f"Random Forest Mean Cross-Validated MSE: {-rf_scores.mean()}")
Linear Regression Mean Cross-Validated MSE: 679392109.7019784 Random Forest Mean Cross-Validated MSE: 637366769.3314102
Model Diagnostics
Examine residuals for Linear Regression to ensure they are randomly dispersed:
import matplotlib.pyplot as plt
import seaborn as sns
# Train Linear Regression Model
lr_model.fit(X_train, y_train)
y_train_pred = lr_model.predict(X_train)
residuals = y_train - y_train_pred
# Plot residuals
plt.figure(figsize=(10, 6))
sns.scatterplot(x=y_train_pred, y=residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Predicted Values')
plt.show()
Feature Importance (Random Forest)
Evaluate feature importance to understand which features impact the model predictions:
import pandas as pd
# Train Random Forest Model
rf_model.fit(X_train, y_train)
# Get feature importances
importances = rf_model.feature_importances_
feature_names = X.columns
feature_importances = pd.Series(importances, index=feature_names).sort_values(ascending=False)
# Plot feature importances
plt.figure(figsize=(12, 8))
feature_importances.plot(kind='bar')
plt.title('Feature Importances from Random Forest')
plt.show()
Evaluate on Test Data
# Linear Regression Evaluation
lr_model.fit(X_train, y_train)
y_test_pred_lr = lr_model.predict(X_test)
mse_lr = mean_squared_error(y_test, y_test_pred_lr)
r2_lr = r2_score(y_test, y_test_pred_lr)
print(f"Linear Regression Mean Squared Error: {mse_lr}")
print(f"Linear Regression R-squared: {r2_lr}")
# Random Forest Evaluation
rf_y_pred = rf_model.predict(X_test)
mse_rf = mean_squared_error(y_test, rf_y_pred)
r2_rf = r2_score(y_test, rf_y_pred)
print(f"Random Forest Mean Squared Error: {mse_rf}")
print(f"Random Forest R-squared: {r2_rf}")
Linear Regression Mean Squared Error: 494300444.4112894 Linear Regression R-squared: 0.9708188257223398 Random Forest Mean Squared Error: 131796235.34221615 Random Forest R-squared: 0.9922193699072206
### Check for outliers:
# Select only numeric columns for analysis
numeric_df = df.select_dtypes(include=[np.number])
# Calculate Q1 (25th percentile) and Q3 (75th percentile) for numeric columns
Q1 = numeric_df.quantile(0.25)
Q3 = numeric_df.quantile(0.75)
# Calculate the IQR
IQR = Q3 - Q1
# Detect outliers for each column (values outside of 1.5 * IQR)
outliers = (numeric_df < (Q1 - 1.5 * IQR)) | (numeric_df > (Q3 + 1.5 * IQR))
# Count the number of outliers for each column
outliers_count = outliers.sum()
print(outliers_count)
year 0 meningitis 1029 alzheimers_disease_and_other_dementias 819 parkinsons_disease 811 nutritional_deficiencies 950 malaria 1278 drowning 733 interpersonal_violence 841 maternal_disorders 789 hiv/aids 1041 drug_use_disorders 725 tuberculosis 916 cardiovascular_diseases 732 lower_respiratory_infections 593 neonatal_disorders 777 alcohol_use_disorders 685 self-harm 722 exposure_to_forces_of_nature 1025 diarrheal_diseases 926 environmental_heat_and_cold_exposure 559 neoplasms 768 conflict_and_terrorism 1188 diabetes_mellitus 872 chronic_kidney_disease 787 poisonings 580 protein-energy_malnutrition 994 road_injuries 765 chronic_respiratory_diseases 918 cirrhosis_and_other_chronic_liver_diseases 796 digestive_diseases 812 fire,_heat,_and_hot_substances 562 acute_hepatitis 802 total_no_of_deaths 712 meningitis_change 1508 cumulative_deaths 693 dtype: int64
# List of columns to plot
columns_to_plot = ['malaria', 'conflict_and_terrorism', 'diabetes_mellitus', 'cardiovascular_diseases']
# Plotting the boxplots for selected columns
plt.figure(figsize=(14, 8))
for i, col in enumerate(columns_to_plot, 1):
plt.subplot(2, 2, i)
sns.boxplot(y=numeric_df[col])
plt.title(f'Boxplot for {col}')
plt.tight_layout()
plt.show()
# Cap/floor outliers at the 5th and 95th percentile
lower_bound = numeric_df.quantile(0.05)
upper_bound = numeric_df.quantile(0.95)
# Applying the capping
capped_df = numeric_df.clip(lower=lower_bound, upper=upper_bound, axis=1)
# Check the capped dataframe
capped_df.head()
| year | meningitis | alzheimers_disease_and_other_dementias | parkinsons_disease | nutritional_deficiencies | malaria | drowning | interpersonal_violence | maternal_disorders | hiv/aids | ... | protein-energy_malnutrition | road_injuries | chronic_respiratory_diseases | cirrhosis_and_other_chronic_liver_diseases | digestive_diseases | fire,_heat,_and_hot_substances | acute_hepatitis | total_no_of_deaths | meningitis_change | cumulative_deaths | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1991 | 2159.00 | 1116.00 | 371.00 | 2087.00 | 93.00 | 1370.00 | 1538.00 | 2655.00 | 34.00 | ... | 2054.00 | 4154.00 | 5945.00 | 2673.00 | 5005.00 | 323.00 | 1569.05 | 147971.00 | 0.00 | 147971.00 |
| 1 | 1991 | 2218.00 | 1136.00 | 374.00 | 2153.00 | 189.00 | 1391.00 | 2001.00 | 2885.00 | 41.00 | ... | 2119.00 | 4472.00 | 6050.00 | 2728.00 | 5120.00 | 332.00 | 1569.05 | 156844.00 | 59.00 | 304815.00 |
| 2 | 1992 | 2475.00 | 1162.00 | 378.00 | 2441.00 | 239.00 | 1514.00 | 2299.00 | 3315.00 | 48.00 | ... | 2404.00 | 5106.00 | 6223.00 | 2830.00 | 5335.00 | 360.00 | 1569.05 | 169156.00 | 65.00 | 473971.00 |
| 3 | 1993 | 2812.00 | 1187.00 | 384.00 | 2837.00 | 108.00 | 1687.00 | 2589.00 | 3636.15 | 56.00 | ... | 2797.00 | 5681.00 | 6445.00 | 2943.00 | 5568.00 | 396.00 | 1569.05 | 182230.00 | 65.00 | 656201.00 |
| 4 | 1994 | 3027.00 | 1211.00 | 391.00 | 3081.00 | 211.00 | 1809.00 | 2849.00 | 3636.15 | 63.00 | ... | 3038.00 | 6001.00 | 6664.00 | 3027.00 | 5739.00 | 420.00 | 1569.05 | 194795.00 | 65.00 | 850996.00 |
5 rows × 35 columns
# Recalculate Q1 (25th percentile) and Q3 (75th percentile) for capped data
Q1_capped = capped_df.quantile(0.25)
Q3_capped = capped_df.quantile(0.75)
# Calculate the IQR for the capped data
IQR_capped = Q3_capped - Q1_capped
# Detect outliers for each column in the capped data (values outside of 1.5 * IQR)
outliers_capped = (capped_df < (Q1_capped - 1.5 * IQR_capped)) | (capped_df > (Q3_capped + 1.5 * IQR_capped))
# Count the number of outliers for each column in the capped data
outliers_capped_count = outliers_capped.sum()
print(outliers_capped_count)
year 0 meningitis 1029 alzheimers_disease_and_other_dementias 819 parkinsons_disease 811 nutritional_deficiencies 950 malaria 1278 drowning 733 interpersonal_violence 841 maternal_disorders 789 hiv/aids 1041 drug_use_disorders 725 tuberculosis 916 cardiovascular_diseases 732 lower_respiratory_infections 593 neonatal_disorders 777 alcohol_use_disorders 685 self-harm 722 exposure_to_forces_of_nature 1025 diarrheal_diseases 926 environmental_heat_and_cold_exposure 559 neoplasms 768 conflict_and_terrorism 1188 diabetes_mellitus 872 chronic_kidney_disease 787 poisonings 580 protein-energy_malnutrition 994 road_injuries 765 chronic_respiratory_diseases 918 cirrhosis_and_other_chronic_liver_diseases 796 digestive_diseases 812 fire,_heat,_and_hot_substances 562 acute_hepatitis 802 total_no_of_deaths 712 meningitis_change 1508 cumulative_deaths 693 dtype: int64
# Recalculate the 1st and 99th percentiles for more aggressive capping
lower_bound_strict = numeric_df.quantile(0.01)
upper_bound_strict = numeric_df.quantile(0.99)
# Apply stricter capping to the dataframe
capped_df_strict = numeric_df.clip(lower=lower_bound_strict, upper=upper_bound_strict, axis=1)
# Check the capped dataframe
print(capped_df_strict.head())
year meningitis alzheimers_disease_and_other_dementias \ 0 1990 2159.00 1116.00 1 1991 2218.00 1136.00 2 1992 2475.00 1162.00 3 1993 2812.00 1187.00 4 1994 3027.00 1211.00 parkinsons_disease nutritional_deficiencies malaria drowning \ 0 371.00 2087.00 93.00 1370.00 1 374.00 2153.00 189.00 1391.00 2 378.00 2441.00 239.00 1514.00 3 384.00 2837.00 108.00 1687.00 4 391.00 3081.00 211.00 1809.00 interpersonal_violence maternal_disorders hiv/aids ... \ 0 1538.00 2655.00 34.00 ... 1 2001.00 2885.00 41.00 ... 2 2299.00 3315.00 48.00 ... 3 2589.00 3671.00 56.00 ... 4 2849.00 3863.00 63.00 ... protein-energy_malnutrition road_injuries chronic_respiratory_diseases \ 0 2054.00 4154.00 5945.00 1 2119.00 4472.00 6050.00 2 2404.00 5106.00 6223.00 3 2797.00 5681.00 6445.00 4 3038.00 6001.00 6664.00 cirrhosis_and_other_chronic_liver_diseases digestive_diseases \ 0 2673.00 5005.00 1 2728.00 5120.00 2 2830.00 5335.00 3 2943.00 5568.00 4 3027.00 5739.00 fire,_heat,_and_hot_substances acute_hepatitis total_no_of_deaths \ 0 323.00 2985.00 147971.00 1 332.00 3092.00 156844.00 2 360.00 3325.00 169156.00 3 396.00 3601.00 182230.00 4 420.00 3816.00 194795.00 meningitis_change cumulative_deaths 0 0.00 147971.00 1 59.00 304815.00 2 257.00 473971.00 3 337.00 656201.00 4 215.00 850996.00 [5 rows x 35 columns]
# Recalculate Q1 and Q3 for capped data (after stricter capping)
Q1_strict = capped_df_strict.quantile(0.25)
Q3_strict = capped_df_strict.quantile(0.75)
# Calculate the IQR for the capped data
IQR_strict = Q3_strict - Q1_strict
# Detect outliers for each column (values outside of 1.5 * IQR)
outliers_strict = (capped_df_strict < (Q1_strict - 1.5 * IQR_strict)) | (capped_df_strict > (Q3_strict + 1.5 * IQR_strict))
# Count the number of outliers for each column in the strictly capped data
outliers_strict_count = outliers_strict.sum()
print(outliers_strict_count)
year 0 meningitis 1029 alzheimers_disease_and_other_dementias 819 parkinsons_disease 811 nutritional_deficiencies 950 malaria 1278 drowning 733 interpersonal_violence 841 maternal_disorders 789 hiv/aids 1041 drug_use_disorders 725 tuberculosis 916 cardiovascular_diseases 732 lower_respiratory_infections 593 neonatal_disorders 777 alcohol_use_disorders 685 self-harm 722 exposure_to_forces_of_nature 1025 diarrheal_diseases 926 environmental_heat_and_cold_exposure 559 neoplasms 768 conflict_and_terrorism 1188 diabetes_mellitus 872 chronic_kidney_disease 787 poisonings 580 protein-energy_malnutrition 994 road_injuries 765 chronic_respiratory_diseases 918 cirrhosis_and_other_chronic_liver_diseases 796 digestive_diseases 812 fire,_heat,_and_hot_substances 562 acute_hepatitis 802 total_no_of_deaths 712 meningitis_change 1508 cumulative_deaths 693 dtype: int64
# Apply log transformation to columns where values are greater than 0 (log can only be applied to positive numbers)
log_transformed_df = capped_df_strict.apply(lambda x: np.log(x + 1) if (x > 0).all() else x)
# Recalculate Q1 and Q3 after log transformation
Q1_log = log_transformed_df.quantile(0.25)
Q3_log = log_transformed_df.quantile(0.75)
# Calculate IQR after log transformation
IQR_log = Q3_log - Q1_log
# Detect outliers after log transformation
outliers_log = (log_transformed_df < (Q1_log - 1.5 * IQR_log)) | (log_transformed_df > (Q3_log + 1.5 * IQR_log))
# Count the number of outliers after log transformation
outliers_log_count = outliers_log.sum()
print(outliers_log_count)
year 0 meningitis 1029 alzheimers_disease_and_other_dementias 819 parkinsons_disease 811 nutritional_deficiencies 950 malaria 1278 drowning 733 interpersonal_violence 841 maternal_disorders 789 hiv/aids 1041 drug_use_disorders 725 tuberculosis 916 cardiovascular_diseases 0 lower_respiratory_infections 0 neonatal_disorders 777 alcohol_use_disorders 685 self-harm 0 exposure_to_forces_of_nature 1025 diarrheal_diseases 926 environmental_heat_and_cold_exposure 559 neoplasms 0 conflict_and_terrorism 1188 diabetes_mellitus 124 chronic_kidney_disease 0 poisonings 580 protein-energy_malnutrition 994 road_injuries 0 chronic_respiratory_diseases 0 cirrhosis_and_other_chronic_liver_diseases 0 digestive_diseases 0 fire,_heat,_and_hot_substances 562 acute_hepatitis 802 total_no_of_deaths 62 meningitis_change 1508 cumulative_deaths 72 dtype: int64
import seaborn as sns
import matplotlib.pyplot as plt
# Visualize outliers using boxplots
plt.figure(figsize=(15, 10))
sns.boxplot(data=df[['meningitis', 'alzheimers_disease_and_other_dementias', 'parkinsons_disease',
'nutritional_deficiencies', 'malaria', 'drowning', 'interpersonal_violence']])
plt.xticks(rotation=90)
plt.show()
import numpy as np
# Identify all numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
# Visualize outliers for each numeric column using boxplot
for col in numeric_cols:
plt.figure(figsize=(8, 4))
sns.boxplot(x=df[col])
plt.title(f'Boxplot of {col}')
plt.show()
# Handling outliers by capping at 1.5 * IQR for all numeric columns
for col in numeric_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df[col] = df[col].clip(lower=lower_bound, upper=upper_bound)
# Compute summary statistics before handling outliers
stats_before = df.describe()
# Copy the original dataframe to handle outliers
df_before = df.copy()
# Handle outliers by capping using IQR
for col in numeric_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df[col] = df[col].clip(lower=lower_bound, upper=upper_bound)
# Compute summary statistics after handling outliers
stats_after = df.describe()
# Compare before and after statistics
stats_comparison = pd.concat([stats_before, stats_after], axis=1, keys=['Before', 'After'])
print(stats_comparison)
Before \
year meningitis alzheimers_disease_and_other_dementias
count 6120.00 6120.00 6120.00
mean 2004.50 558.62 1677.19
std 8.66 784.30 2105.45
min 1990.00 0.00 0.00
25% 1997.00 15.00 90.00
50% 2004.50 109.00 666.50
75% 2012.00 847.25 2456.25
max 2019.00 2095.62 6005.62
\
parkinsons_disease nutritional_deficiencies malaria drowning
count 6120.00 6120.00 6120.00 6120.00
mean 412.24 739.49 245.05 468.94
std 513.54 1085.38 403.45 574.04
min 0.00 0.00 0.00 0.00
25% 27.00 9.00 0.00 34.00
50% 164.00 119.00 0.00 177.00
75% 609.25 1167.25 393.00 698.00
max 1482.62 2904.62 982.50 1694.00
... \
interpersonal_violence maternal_disorders hiv/aids ...
count 6120.00 6120.00 6120.00 ...
mean 612.05 453.71 1198.91 ...
std 740.67 664.04 1790.61 ...
min 0.00 0.00 0.00 ...
25% 40.00 5.00 11.00 ...
50% 265.00 54.00 136.00 ...
75% 877.00 734.00 1879.00 ...
max 2132.50 1827.50 4681.00 ...
After \
protein-energy_malnutrition road_injuries chronic_respiratory_diseases
count 6120.00 6120.00 6120.00
mean 659.48 2380.77 3656.55
std 982.64 2912.78 4472.84
min 0.00 0.00 1.00
25% 5.00 174.75 289.00
50% 92.00 966.50 1689.00
75% 1042.50 3435.25 5249.75
max 2598.75 8326.00 12690.88
\
cirrhosis_and_other_chronic_liver_diseases digestive_diseases
count 6120.00 6120.00
mean 2470.60 4303.08
std 2958.07 5049.07
min 0.00 0.00
25% 154.00 284.00
50% 1210.00 2185.00
75% 3547.25 6080.00
max 8637.12 14774.00
\
fire,_heat,_and_hot_substances acute_hepatitis total_no_of_deaths
count 6120.00 6120.00 6120.00
mean 284.92 101.45 107820.60
std 350.75 143.79 128801.21
min 0.00 0.00 7.00
25% 17.00 2.00 6935.00
50% 126.00 15.00 50257.50
75% 450.00 160.00 158221.00
max 1099.50 397.00 385150.00
meningitis_change cumulative_deaths
count 6120.00 6120.00
mean -5.15 1483344.83
std 12.56 1879915.84
min -27.50 13.00
25% -11.00 71995.25
50% -1.00 553431.50
75% 0.00 2266613.50
max 16.50 5558540.88
[8 rows x 70 columns]
Robust Scaling
Use RobustScaler to scale features in a way that's robust to outliers.
from sklearn.preprocessing import RobustScaler
# Initialize RobustScaler
scaler = RobustScaler()
# Fit and transform the numeric columns
numeric_df_scaled = pd.DataFrame(scaler.fit_transform(numeric_df), columns=numeric_df.columns)
# Check the results
print(numeric_df_scaled.describe())
year meningitis alzheimers_disease_and_other_dementias \
count 6120.00 6120.00 6120.00
mean 0.00 1.94 1.77
std 0.58 8.02 7.70
min -0.97 -0.13 -0.28
25% -0.50 -0.11 -0.24
50% 0.00 0.00 0.00
75% 0.50 0.89 0.76
max 0.97 118.05 135.26
parkinsons_disease nutritional_deficiencies malaria drowning \
count 6120.00 6120.00 6120.00 6120.00
mean 1.73 1.84 10.54 2.27
std 7.93 9.05 46.89 13.37
min -0.28 -0.10 0.00 -0.27
25% -0.24 -0.09 0.00 -0.22
50% 0.00 0.00 0.00 0.00
75% 0.76 0.91 1.00 0.78
max 131.95 231.47 714.01 231.32
interpersonal_violence maternal_disorders hiv/aids ... \
count 6120.00 6120.00 6120.00 ...
mean 2.17 1.66 3.11 ...
std 8.26 8.31 11.25 ...
min -0.32 -0.07 -0.07 ...
25% -0.27 -0.07 -0.07 ...
50% 0.00 0.00 0.00 ...
75% 0.73 0.93 0.93 ...
max 82.89 147.98 163.47 ...
protein-energy_malnutrition road_injuries \
count 6120.00 6120.00
mean 1.81 1.52
std 7.96 7.39
min -0.09 -0.30
25% -0.08 -0.24
50% 0.00 0.00
75% 0.92 0.76
max 194.84 100.68
chronic_respiratory_diseases \
count 6120.00
mean 3.11
std 21.20
min -0.34
25% -0.28
50% 0.00
75% 0.72
max 275.03
cirrhosis_and_other_chronic_liver_diseases digestive_diseases \
count 6120.00 6120.00
mean 1.45 1.47
std 6.10 6.42
min -0.36 -0.38
25% -0.31 -0.33
50% 0.00 0.00
75% 0.69 0.67
max 79.22 79.84
fire,_heat,_and_hot_substances acute_hepatitis total_no_of_deaths \
count 6120.00 6120.00 6120.00
mean 1.07 3.82 1.25
std 4.92 26.49 5.78
min -0.29 -0.09 -0.33
25% -0.25 -0.08 -0.29
50% 0.00 0.00 0.00
75% 0.75 0.92 0.71
max 59.47 406.90 68.69
meningitis_change cumulative_deaths
count 6120.00 6120.00
mean -1.83 1.42
std 83.63 7.05
min -975.18 -0.25
25% -0.91 -0.22
50% 0.00 0.00
75% 0.09 0.78
max 4848.55 120.68
[8 rows x 35 columns]
Explore Different Thresholds
Apply capping at different percentiles (e.g., 95th percentile) to see if it’s a better fit.
# Capping at the 95th percentile
upper_cap_95 = numeric_df.quantile(0.95)
numeric_df_capped_95 = numeric_df.clip(upper=upper_cap_95, axis=1)
# Visualizing the capped data at the 95th percentile
plt.figure(figsize=(15, 10))
sns.boxplot(data=numeric_df_capped_95[['meningitis', 'alzheimers_disease_and_other_dementias', 'parkinsons_disease']])
plt.title("After 95th Percentile Capping")
plt.xticks(rotation=90)
plt.show()
# Summary statistics after 95th percentile capping
print("\nAfter 95th Percentile Capping:")
print(numeric_df_capped_95[['meningitis', 'alzheimers_disease_and_other_dementias', 'parkinsons_disease']].describe())
After 95th Percentile Capping:
meningitis alzheimers_disease_and_other_dementias parkinsons_disease
count 6120.00 6120.00 6120.00
mean 918.75 2791.50 665.43
std 1654.04 5128.79 1192.28
min 0.00 0.00 0.00
25% 15.00 90.00 27.00
50% 109.00 666.50 164.00
75% 847.25 2456.25 609.25
max 6110.10 20386.30 4707.15
Visualize More Columns
Visualize additional columns to ensure comprehensive outlier handling.
# Visualize more columns
plt.figure(figsize=(20, 15))
sns.boxplot(data=numeric_df[['malaria', 'drowning', 'interpersonal_violence', 'maternal_disorders',
'hiv/aids', 'drug_use_disorders', 'tuberculosis', 'cardiovascular_diseases',
'lower_respiratory_infections', 'neonatal_disorders']])
plt.xticks(rotation=90)
plt.title("Boxplots of Additional Columns")
plt.show()
Random Forest Regressor model
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
# Assuming you want to predict total number of deaths
X = numeric_df_capped_95.drop(columns=['total_no_of_deaths'])
y = numeric_df_capped_95['total_no_of_deaths']
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit a Random Forest model
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')
Mean Squared Error: 289725277.01695436 R^2 Score: 0.9938607881978867
This shows which variables are most influential in predicting the target.
importances = model.feature_importances_
feature_names = X.columns
feature_importance_df = pd.DataFrame({'feature': feature_names, 'importance': importances})
print(feature_importance_df.sort_values(by='importance', ascending=False))
feature importance 29 digestive_diseases 0.87 30 fire,_heat,_and_hot_substances 0.06 12 cardiovascular_diseases 0.02 24 poisonings 0.01 22 diabetes_mellitus 0.01 16 self-harm 0.01 6 drowning 0.01 27 chronic_respiratory_diseases 0.00 14 neonatal_disorders 0.00 13 lower_respiratory_infections 0.00 9 hiv/aids 0.00 7 interpersonal_violence 0.00 20 neoplasms 0.00 18 diarrheal_diseases 0.00 23 chronic_kidney_disease 0.00 33 cumulative_deaths 0.00 28 cirrhosis_and_other_chronic_liver_diseases 0.00 19 environmental_heat_and_cold_exposure 0.00 11 tuberculosis 0.00 25 protein-energy_malnutrition 0.00 15 alcohol_use_disorders 0.00 4 nutritional_deficiencies 0.00 5 malaria 0.00 26 road_injuries 0.00 0 year 0.00 1 meningitis 0.00 31 acute_hepatitis 0.00 10 drug_use_disorders 0.00 2 alzheimers_disease_and_other_dementias 0.00 3 parkinsons_disease 0.00 8 maternal_disorders 0.00 21 conflict_and_terrorism 0.00 32 meningitis_change 0.00 17 exposure_to_forces_of_nature 0.00
Cross-Validation:
To ensure that the model is robust, use cross-validation to evaluate its performance across multiple subsets of the data.
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f'Cross-validated R² scores: {cv_scores}')
print(f'Mean CV R² score: {cv_scores.mean()}')
Cross-validated R² scores: [0.98481804 0.91487192 0.95665849 0.977272 0.9252078 ] Mean CV R² score: 0.9517656470976534
Log Transformation (if the target is skewed):
If the target variable (total_no_of_deaths) is highly skewed, applying a log transformation can help reduce the impact of outliers.
import numpy as np
y_log = np.log1p(y) # Apply log transformation
Feature Selection:
Remove features with zero or negligible importance and re-run the model. This will reduce the dimensionality of the dataset and focus on the most impactful features.
# Dropping features with low importance
low_importance_features = feature_importance_df[feature_importance_df['importance'] == 0]['feature']
X_reduced = X.drop(columns=low_importance_features)
# Re-run train-test split and model
X_train, X_test, y_train, y_test = train_test_split(X_reduced, y_log, test_size=0.2, random_state=42)
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)
# Predictions and evaluation
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R^2 Score: {r2}')
numeric_df.columns
df.head()
OBSERVATIONS (CHINA , INDIA AND USA) face the largest brunt of deaths due to diseases in the world Cardiovascular diseases , Neoplasms (Malignancy/Cancer) and Lower Respiratory Tract Infections (for example : Pneumonia) are the top 3 killer disases in the world.
import numpy as np
# Apply log transformation to relevant columns
transformed_df = df.copy()
columns_to_transform = ['meningitis', 'alzheimers_disease_and_other_dementias',
'parkinsons_disease', 'nutritional_deficiencies',
'malaria', 'drowning', 'interpersonal_violence',
'maternal_disorders', 'hiv/aids', 'chronic_kidney_disease',
'poisonings', 'protein-energy_malnutrition', 'road_injuries',
'chronic_respiratory_diseases', 'cirrhosis_and_other_chronic_liver_diseases',
'digestive_diseases', 'fire,_heat,_and_hot_substances', 'acute_hepatitis']
for col in columns_to_transform:
transformed_df[col] = np.log1p(transformed_df[col]) # log1p is used to handle zero values
transformed_df.describe()
| year | meningitis | alzheimers_disease_and_other_dementias | parkinsons_disease | nutritional_deficiencies | malaria | drowning | interpersonal_violence | maternal_disorders | hiv/aids | ... | protein-energy_malnutrition | road_injuries | chronic_respiratory_diseases | cirrhosis_and_other_chronic_liver_diseases | digestive_diseases | fire,_heat,_and_hot_substances | acute_hepatitis | total_no_of_deaths | meningitis_change | cumulative_deaths | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | ... | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 | 6120.00 |
| mean | 2004.50 | 4.55 | 6.01 | 4.76 | 4.55 | 2.29 | 4.91 | 5.13 | 4.05 | 4.86 | ... | 4.30 | 6.39 | 6.88 | 6.44 | 7.04 | 4.36 | 3.00 | 107820.60 | -21.16 | 1483344.83 |
| std | 8.66 | 2.39 | 2.23 | 1.99 | 2.57 | 2.96 | 1.99 | 2.10 | 2.53 | 2.71 | ... | 2.65 | 2.26 | 2.16 | 2.25 | 2.23 | 2.08 | 2.11 | 128801.21 | 919.88 | 1879915.84 |
| min | 1990.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.69 | 0.00 | 0.00 | 0.00 | 0.00 | 7.00 | -10728.00 | 13.00 |
| 25% | 1997.00 | 2.77 | 4.51 | 3.33 | 2.30 | 0.00 | 3.56 | 3.71 | 1.79 | 2.48 | ... | 1.79 | 5.17 | 5.67 | 5.04 | 5.65 | 2.89 | 1.10 | 6935.00 | -11.00 | 71995.25 |
| 50% | 2004.50 | 4.70 | 6.50 | 5.11 | 4.79 | 0.00 | 5.18 | 5.58 | 4.01 | 4.92 | ... | 4.53 | 6.87 | 7.43 | 7.10 | 7.69 | 4.84 | 2.77 | 50257.50 | -1.00 | 553431.50 |
| 75% | 2012.00 | 6.74 | 7.81 | 6.41 | 7.06 | 5.98 | 6.55 | 6.78 | 6.60 | 7.54 | ... | 6.95 | 8.14 | 8.57 | 8.17 | 8.71 | 6.11 | 5.08 | 158221.00 | 0.00 | 2266613.50 |
| max | 2019.00 | 7.65 | 8.70 | 7.30 | 7.97 | 6.89 | 7.44 | 7.67 | 7.51 | 8.45 | ... | 7.86 | 9.03 | 9.45 | 9.06 | 9.60 | 7.00 | 5.99 | 385150.00 | 53333.00 | 5558540.88 |
8 rows × 35 columns
For the bonus section, we will use a simple machine learning model to predict future deaths based on the historical data.
# Define features and target
X = df.drop(['total_no_of_deaths'], axis=1)
y = df['total_no_of_deaths']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Linear Regression Model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Initialize the model
model = LinearRegression()
# Train the model
model.fit(X_train, y_train)
# Make predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)
# Evaluate the model
mae_train = mean_absolute_error(y_train, y_pred_train)
mse_train = mean_squared_error(y_train, y_pred_train)
r2_train = r2_score(y_train, y_pred_train)
mae_test = mean_absolute_error(y_test, y_pred_test)
mse_test = mean_squared_error(y_test, y_pred_test)
r2_test = r2_score(y_test, y_pred_test)
# Print evaluation metrics
print("Train Set Evaluation:")
print(f"MAE: {mae_train:.2f}, MSE: {mse_train:.2f}, R²: {r2_train:.2f}")
print("\nTest Set Evaluation:")
print(f"MAE: {mae_test:.2f}, MSE: {mse_test:.2f}, R²: {r2_test:.2f}")
Train Set Evaluation: MAE: 12220.00, MSE: 368377986.99, R²: 0.98 Test Set Evaluation: MAE: 13204.36, MSE: 494300444.41, R²: 0.97
Random Foret Regressor
from sklearn.ensemble import RandomForestRegressor
# Initialize the model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
# Train the model
rf_model.fit(X_train, y_train)
# Make predictions
y_pred_train_rf = rf_model.predict(X_train)
y_pred_test_rf = rf_model.predict(X_test)
# Evaluate the model
mae_train_rf = mean_absolute_error(y_train, y_pred_train_rf)
mse_train_rf = mean_squared_error(y_train, y_pred_train_rf)
r2_train_rf = r2_score(y_train, y_pred_train_rf)
mae_test_rf = mean_absolute_error(y_test, y_pred_test_rf)
mse_test_rf = mean_squared_error(y_test, y_pred_test_rf)
r2_test_rf = r2_score(y_test, y_pred_test_rf)
# Print evaluation metrics
print("Random Forest Train Set Evaluation:")
print(f"MAE: {mae_train_rf:.2f}, MSE: {mse_train_rf:.2f}, R²: {r2_train_rf:.2f}")
print("\nRandom Forest Test Set Evaluation:")
print(f"MAE: {mae_test_rf:.2f}, MSE: {mse_test_rf:.2f}, R²: {r2_test_rf:.2f}")
Random Forest Train Set Evaluation: MAE: 761.16, MSE: 5878023.11, R²: 1.00 Random Forest Test Set Evaluation: MAE: 2436.79, MSE: 130860626.72, R²: 0.99
Polynomial Regression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Create polynomial features (e.g., degree 2 for quadratic regression)
poly = PolynomialFeatures(degree=2)
X_poly_train = poly.fit_transform(X_train)
X_poly_test = poly.transform(X_test)
# Initialize and train the model
poly_model = LinearRegression()
poly_model.fit(X_poly_train, y_train)
# Predict on the training and test sets
y_poly_train_pred = poly_model.predict(X_poly_train)
y_poly_test_pred = poly_model.predict(X_poly_test)
# Evaluate the model
mse_poly_train = mean_squared_error(y_train, y_poly_train_pred)
r2_poly_train = r2_score(y_train, y_poly_train_pred)
mse_poly_test = mean_squared_error(y_test, y_poly_test_pred)
r2_poly_test = r2_score(y_test, y_poly_test_pred)
# Print evaluation metrics
print("Polynomial Regression Train Set Evaluation:")
print(f"MAE: {mean_absolute_error(y_train, y_poly_train_pred):.2f}, MSE: {mse_poly_train:.2f}, R²: {r2_poly_train:.2f}")
print("\nPolynomial Regression Test Set Evaluation:")
print(f"MAE: {mean_absolute_error(y_test, y_poly_test_pred):.2f}, MSE: {mse_poly_test:.2f}, R²: {r2_poly_test:.2f}")
Polynomial Regression Train Set Evaluation: MAE: 2752.13, MSE: 26848266.57, R²: 1.00 Polynomial Regression Test Set Evaluation: MAE: 3968.86, MSE: 113650941.75, R²: 0.99
Time Series Modeling with ARIMA
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
# Prepare the data for ARIMA model (set 'year' as index)
df.set_index('year', inplace=True)
# Train ARIMA model (order can be adjusted based on data)
arima_model = ARIMA(df['total_no_of_deaths'], order=(1, 1, 1)) # You can tune (p,d,q) parameters
arima_model_fit = arima_model.fit()
# Print model summary
print(arima_model_fit.summary())
# Make predictions
y_pred_arima = arima_model_fit.forecast(steps=len(X_test)) # Forecast future values based on test set size
# Evaluate ARIMA model
mse_arima = mean_squared_error(y_test, y_pred_arima)
r2_arima = r2_score(y_test, y_pred_arima)
print("\nARIMA Test Set Evaluation:")
print(f"MSE: {mse_arima:.2f}, R²: {r2_arima:.2f}")
D:\SAMAANACONDA\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting. D:\SAMAANACONDA\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting. D:\SAMAANACONDA\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting.
SARIMAX Results
==============================================================================
Dep. Variable: total_no_of_deaths No. Observations: 6120
Model: ARIMA(1, 1, 1) Log Likelihood -72382.157
Date: Tue, 17 Sep 2024 AIC 144770.314
Time: 23:26:41 BIC 144790.472
Sample: 0 HQIC 144777.307
- 6120
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 -0.1644 3.681 -0.045 0.964 -7.379 7.050
ma.L1 0.1456 3.682 0.040 0.968 -7.070 7.362
sigma2 1.103e+09 1.99e-07 5.55e+15 0.000 1.1e+09 1.1e+09
===================================================================================
Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 1866416.09
Prob(Q): 0.99 Prob(JB): 0.00
Heteroskedasticity (H): 0.94 Skew: -1.55
Prob(H) (two-sided): 0.18 Kurtosis: 88.50
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 4.88e+30. Standard errors may be unstable.
ARIMA Test Set Evaluation:
MSE: 17168807798.80, R²: -0.01
D:\SAMAANACONDA\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`. D:\SAMAANACONDA\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception.
# Load the dataset
df = pd.read_csv('cause_of_deaths.csv')
cause_of_deaths = ['Meningitis',
'Alzheimer\'s Disease and Other Dementias', 'Parkinson\'s Disease',
'Nutritional Deficiencies', 'Malaria', 'Drowning',
'Interpersonal Violence', 'Maternal Disorders', 'HIV/AIDS',
'Drug Use Disorders', 'Tuberculosis', 'Cardiovascular Diseases',
'Lower Respiratory Infections', 'Neonatal Disorders',
'Alcohol Use Disorders', 'Self-harm', 'Exposure to Forces of Nature',
'Diarrheal Diseases', 'Environmental Heat and Cold Exposure',
'Neoplasms', 'Conflict and Terrorism', 'Diabetes Mellitus',
'Chronic Kidney Disease', 'Poisonings', 'Protein-Energy Malnutrition',
'Road Injuries', 'Chronic Respiratory Diseases',
'Cirrhosis and Other Chronic Liver Diseases', 'Digestive Diseases',
'Fire, Heat, and Hot Substances', 'Acute Hepatitis']
# Creating a new column for 'Total_no_of_Deaths' for individual Country and Year
df['Total_no_of_Deaths'] = df[cause_of_deaths].sum(axis=1)
df['year_squared'] = df['Year'] ** 2
year2 = df.sort_values(by='year_squared',ascending=False)[:10][['year_squared','Year']]
year2
| year_squared | Year | |
|---|---|---|
| 6119 | 4076361 | 2019 |
| 1649 | 4076361 | 2019 |
| 5489 | 4076361 | 2019 |
| 4349 | 4076361 | 2019 |
| 2729 | 4076361 | 2019 |
| 2039 | 4076361 | 2019 |
| 4379 | 4076361 | 2019 |
| 659 | 4076361 | 2019 |
| 5459 | 4076361 | 2019 |
| 4409 | 4076361 | 2019 |
# 1. Polynomial Features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(df[['Year']])
# 2. Adding Log Transformation
df['log_total_no_of_deaths'] = np.log1p(df['Total_no_of_Deaths'])
# 3. Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_poly)
# Prepare the features and target variable
X = pd.DataFrame(X_scaled, columns=poly.get_feature_names_out())
y = df['log_total_no_of_deaths']
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict on the test set
y_pred = model.predict(X_test)
# Calculate regression metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Print the results
print("Linear Regression Model with Feature Engineering:")
print(f"Mean Squared Error: {mse}")
print(f"R^2 Score: {r2}")
Linear Regression Model with Feature Engineering: Mean Squared Error: 5.948201145565115 R^2 Score: 0.00042561720103073686
# Plotting predictions
plt.scatter(X_test['Year'], y_test, color='blue', label='Actual')
plt.scatter(X_test['Year'], np.expm1(y_pred), color='red', label='Predicted')
plt.xlabel('Year')
plt.ylabel('Total Number of Deaths')
plt.title('Actual vs Predicted')
plt.legend()
plt.show()
MSE Improvement: The Mean Squared Error has improved significantly compared to previous models, indicating that the feature engineering steps helped in reducing prediction error.
Adding polynomial features allows the model to capture non-linear relationships between the features and the target variable. In your case, you used a quadratic transformation (degree=2) of the year feature. Log Transformation:
Applying a log transformation to the target variable helps in stabilizing the variance and handling skewness. The transformation np.log1p(df['total_no_of_deaths']) is useful when the data contains large ranges or outliers. Standardization:
Standardizing the features (StandardScaler) ensures that each feature has a mean of 0 and a standard deviation of 1. This is especially useful when combining polynomial features and scaling to ensure proper model convergence.
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
# Initialize and Train Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
# Predict and Evaluate
y_pred_rf = rf_model.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)
print("Random Forest Regression Model:")
print(f"Mean Squared Error: {mse_rf}")
print(f"R^2 Score: {r2_rf}")
Random Forest Regression Model: Mean Squared Error: 5.976878671752126 R^2 Score: -0.0043935406984889624
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
# Example for Ridge Regression
param_grid = {'alpha': [0.1, 1.0, 10.0]}
grid_search = GridSearchCV(Ridge(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print("Best parameters found: ", grid_search.best_params_)
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)
mse_best = mean_squared_error(y_test, y_pred_best)
r2_best = r2_score(y_test, y_pred_best)
print("Tuned Ridge Regression Model:")
print(f"Mean Squared Error: {mse_best}")
print(f"R^2 Score: {r2_best}")
Best parameters found: {'alpha': 0.1}
Tuned Ridge Regression Model:
Mean Squared Error: 5.944337099032542
R^2 Score: 0.0010749566960110979
importances = rf_model.feature_importances_
feature_names = X.columns
sorted_indices = importances.argsort()[::-1]
# Plot feature importances
plt.figure(figsize=(10, 6))
plt.title("Feature Importances")
plt.bar(range(X.shape[1]), importances[sorted_indices], align="center")
plt.xticks(range(X.shape[1]), feature_names[sorted_indices], rotation=90)
plt.xlim([-1, X.shape[1]])
plt.show()
# Adding polynomial features and log transformation
df['year_squared'] = df['Year'] ** 2
df['log_total_no_of_deaths'] = np.log1p(df['Total_no_of_Deaths'])
# Prepare features and target variable
X = df[['Year', 'year_squared']]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X = pd.DataFrame(X_scaled, columns=['Year', 'year_squared'])
y = df['log_total_no_of_deaths']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict on the test set
y_pred = model.predict(X_test)
# Calculate regression metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Print the results
print("Linear Regression Model with Feature Engineering:")
print(f"Mean Squared Error: {mse}")
print(f"R^2 Score: {r2}")
# Plot predictions
plt.scatter(X_test['Year'], np.expm1(y_test), color='blue', label='Actual')
plt.scatter(X_test['Year'], np.expm1(y_pred), color='red', label='Predicted')
plt.xlabel('Year')
plt.ylabel('Total Number of Deaths')
plt.title('Actual vs Predicted')
plt.legend()
plt.show()
Linear Regression Model with Feature Engineering: Mean Squared Error: 5.948201145565117 R^2 Score: 0.00042561720103029277
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
# Grid search for Ridge Regression
param_grid = {'alpha': [0.1, 1.0, 10.0]}
grid_search = GridSearchCV(Ridge(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print("Best parameters found: ", grid_search.best_params_)
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)
mse_best = mean_squared_error(y_test, y_pred_best)
r2_best = r2_score(y_test, y_pred_best)
print("Tuned Ridge Regression Model:")
print(f"Mean Squared Error: {mse_best}")
print(f"R^2 Score: {r2_best}")
Best parameters found: {'alpha': 0.1}
Tuned Ridge Regression Model:
Mean Squared Error: 5.944337099032542
R^2 Score: 0.0010749566960110979
Ridge Regression: Regularized model with hyperparameter tuning.
df.head().T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| Country/Territory | Afghanistan | Afghanistan | Afghanistan | Afghanistan | Afghanistan |
| Code | AFG | AFG | AFG | AFG | AFG |
| Year | 1990 | 1991 | 1992 | 1993 | 1994 |
| Meningitis | 2159 | 2218 | 2475 | 2812 | 3027 |
| Alzheimer's Disease and Other Dementias | 1116 | 1136 | 1162 | 1187 | 1211 |
| Parkinson's Disease | 371 | 374 | 378 | 384 | 391 |
| Nutritional Deficiencies | 2087 | 2153 | 2441 | 2837 | 3081 |
| Malaria | 93 | 189 | 239 | 108 | 211 |
| Drowning | 1370 | 1391 | 1514 | 1687 | 1809 |
| Interpersonal Violence | 1538 | 2001 | 2299 | 2589 | 2849 |
| Maternal Disorders | 2655 | 2885 | 3315 | 3671 | 3863 |
| HIV/AIDS | 34 | 41 | 48 | 56 | 63 |
| Drug Use Disorders | 93 | 102 | 118 | 132 | 142 |
| Tuberculosis | 4661 | 4743 | 4976 | 5254 | 5470 |
| Cardiovascular Diseases | 44899 | 45492 | 46557 | 47951 | 49308 |
| Lower Respiratory Infections | 23741 | 24504 | 27404 | 31116 | 33390 |
| Neonatal Disorders | 15612 | 17128 | 20060 | 22335 | 23288 |
| Alcohol Use Disorders | 72 | 75 | 80 | 85 | 88 |
| Self-harm | 696 | 751 | 855 | 943 | 993 |
| Exposure to Forces of Nature | 0 | 1347 | 614 | 225 | 160 |
| Diarrheal Diseases | 4235 | 4927 | 6123 | 8174 | 8215 |
| Environmental Heat and Cold Exposure | 175 | 113 | 38 | 41 | 44 |
| Neoplasms | 11580 | 11796 | 12218 | 12634 | 12914 |
| Conflict and Terrorism | 1490 | 3370 | 4344 | 4096 | 8959 |
| Diabetes Mellitus | 2108 | 2120 | 2153 | 2195 | 2231 |
| Chronic Kidney Disease | 3709 | 3724 | 3776 | 3862 | 3932 |
| Poisonings | 338 | 351 | 386 | 425 | 451 |
| Protein-Energy Malnutrition | 2054 | 2119 | 2404 | 2797 | 3038 |
| Road Injuries | 4154 | 4472 | 5106 | 5681 | 6001 |
| Chronic Respiratory Diseases | 5945 | 6050 | 6223 | 6445 | 6664 |
| Cirrhosis and Other Chronic Liver Diseases | 2673 | 2728 | 2830 | 2943 | 3027 |
| Digestive Diseases | 5005 | 5120 | 5335 | 5568 | 5739 |
| Fire, Heat, and Hot Substances | 323 | 332 | 360 | 396 | 420 |
| Acute Hepatitis | 2985 | 3092 | 3325 | 3601 | 3816 |
| year_squared | 3960100 | 3964081 | 3968064 | 3972049 | 3976036 |
| Total_no_of_Deaths | 147971 | 156844 | 169156 | 182230 | 194795 |
| log_total_no_of_deaths | 11.90 | 11.96 | 12.04 | 12.11 | 12.18 |
# Using global death trends for prediction
X = df[['Year']]
y = df['Total_no_of_Deaths']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
from sklearn.ensemble import GradientBoostingClassifier
gb_model = GradientBoostingClassifier()
gb_model.fit(X_train, y_train)
gb_predictions = gb_model.predict(X_test)
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(gb_model, X, y, cv=5)
print(f'Cross-Validation Scores: {cv_scores}')
print(f'Mean CV Score: {cv_scores.mean()}')
from sklearn.metrics import confusion_matrix, classification_report
conf_matrix = confusion_matrix(y_test, gb_predictions)
print('Confusion Matrix:\n', conf_matrix)
print('Classification Report:\n', classification_report(y_test, gb_predictions))
from IPython.display import Image, display
# Display an image from a URL
image_url = 'https://th.bing.com/th/id/OIP.MRqC4PFoXLAaZz4nzDowiQHaE8?rs=1&pid=ImgDetMain'
display(Image(url=image_url))
df=pd.read_csv("cause_of_deaths.csv")
# Create a new data frame of New Zealand
Egypt_df = df[df['Country/Territory'] == 'Egypt']
Egypt_df.head()
| Country/Territory | Code | Year | Meningitis | Alzheimer's Disease and Other Dementias | Parkinson's Disease | Nutritional Deficiencies | Malaria | Drowning | Interpersonal Violence | ... | Diabetes Mellitus | Chronic Kidney Disease | Poisonings | Protein-Energy Malnutrition | Road Injuries | Chronic Respiratory Diseases | Cirrhosis and Other Chronic Liver Diseases | Digestive Diseases | Fire, Heat, and Hot Substances | Acute Hepatitis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1620 | Egypt | EGY | 2008 | 1138 | 5212 | 1732 | 886 | 0 | 1025 | 503 | ... | 13100 | 15535 | 175 | 736 | 25929 | 18612 | 48127 | 53614 | 1378 | 1689 |
| 1621 | Egypt | EGY | 2009 | 1137 | 5340 | 1796 | 889 | 0 | 1051 | 535 | ... | 13951 | 16199 | 180 | 734 | 26837 | 19088 | 49630 | 55290 | 1415 | 1657 |
| 1622 | Egypt | EGY | 2010 | 1111 | 5464 | 1847 | 874 | 0 | 1070 | 535 | ... | 14715 | 16806 | 183 | 718 | 27409 | 19409 | 50816 | 56588 | 1447 | 1605 |
| 1623 | Egypt | EGY | 2011 | 1115 | 5607 | 1898 | 871 | 0 | 1093 | 562 | ... | 15183 | 17195 | 186 | 713 | 28205 | 19685 | 51945 | 57828 | 1462 | 1568 |
| 1624 | Egypt | EGY | 2014 | 925 | 5999 | 2053 | 842 | 0 | 1051 | 593 | ... | 16886 | 18893 | 183 | 677 | 28405 | 20849 | 55221 | 61427 | 1474 | 1443 |
5 rows × 34 columns
cause_of_deaths = ['Meningitis',
'Alzheimer\'s Disease and Other Dementias', 'Parkinson\'s Disease',
'Nutritional Deficiencies', 'Malaria', 'Drowning',
'Interpersonal Violence', 'Maternal Disorders', 'HIV/AIDS',
'Drug Use Disorders', 'Tuberculosis', 'Cardiovascular Diseases',
'Lower Respiratory Infections', 'Neonatal Disorders',
'Alcohol Use Disorders', 'Self-harm', 'Exposure to Forces of Nature',
'Diarrheal Diseases', 'Environmental Heat and Cold Exposure',
'Neoplasms', 'Conflict and Terrorism', 'Diabetes Mellitus',
'Chronic Kidney Disease', 'Poisonings', 'Protein-Energy Malnutrition',
'Road Injuries', 'Chronic Respiratory Diseases',
'Cirrhosis and Other Chronic Liver Diseases', 'Digestive Diseases',
'Fire, Heat, and Hot Substances', 'Acute Hepatitis']
# Creating a new column for 'Total_no_of_Deaths' for individual Country and Year
Egypt_df['Total_no_of_Deaths'] = Egypt_df[cause_of_deaths].sum(axis=1)
C:\Users\sama\AppData\Local\Temp\ipykernel_17528\3892502345.py:16: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# Find the total number of each disease in EGYPT
EG_disease = Egypt_df[cause_of_deaths].sum().to_frame().reset_index()
EG_disease.rename(columns = {'index': 'Diseases', 0:'Total_no_of_Deaths'}, inplace = True)
EG_disease
| Diseases | Total_no_of_Deaths | |
|---|---|---|
| 0 | Meningitis | 39101 |
| 1 | Alzheimer's Disease and Other Dementias | 146785 |
| 2 | Parkinson's Disease | 49207 |
| 3 | Nutritional Deficiencies | 30709 |
| 4 | Malaria | 0 |
| 5 | Drowning | 33681 |
| 6 | Interpersonal Violence | 11933 |
| 7 | Maternal Disorders | 34917 |
| 8 | HIV/AIDS | 2784 |
| 9 | Drug Use Disorders | 1366 |
| 10 | Tuberculosis | 35759 |
| 11 | Cardiovascular Diseases | 5995471 |
| 12 | Lower Respiratory Infections | 954868 |
| 13 | Neonatal Disorders | 504806 |
| 14 | Alcohol Use Disorders | 3349 |
| 15 | Self-harm | 70777 |
| 16 | Exposure to Forces of Nature | 1572 |
| 17 | Diarrheal Diseases | 498193 |
| 18 | Environmental Heat and Cold Exposure | 1735 |
| 19 | Neoplasms | 1160639 |
| 20 | Conflict and Terrorism | 7542 |
| 21 | Diabetes Mellitus | 370494 |
| 22 | Chronic Kidney Disease | 445949 |
| 23 | Poisonings | 5649 |
| 24 | Protein-Energy Malnutrition | 26381 |
| 25 | Road Injuries | 796157 |
| 26 | Chronic Respiratory Diseases | 543660 |
| 27 | Cirrhosis and Other Chronic Liver Diseases | 1422257 |
| 28 | Digestive Diseases | 1583081 |
| 29 | Fire, Heat, and Hot Substances | 42655 |
| 30 | Acute Hepatitis | 56882 |
# Find the top 10 cause of deaths in Egypt
Top10_EG_diseases = EG_disease.sort_values(by='Total_no_of_Deaths',ascending = False).head(10)
Top10_EG_diseases
| Diseases | Total_no_of_Deaths | |
|---|---|---|
| 11 | Cardiovascular Diseases | 5995471 |
| 28 | Digestive Diseases | 1583081 |
| 27 | Cirrhosis and Other Chronic Liver Diseases | 1422257 |
| 19 | Neoplasms | 1160639 |
| 12 | Lower Respiratory Infections | 954868 |
| 25 | Road Injuries | 796157 |
| 26 | Chronic Respiratory Diseases | 543660 |
| 13 | Neonatal Disorders | 504806 |
| 17 | Diarrheal Diseases | 498193 |
| 22 | Chronic Kidney Disease | 445949 |
# Create a bar chart of Top 10 cause of deaths in Egypt
plt.figure(figsize=(12,8))
sns.barplot(data = Top10_EG_diseases, x = 'Total_no_of_Deaths', y = 'Diseases', color = 'Blue')
# Add some text for labels, title
plt.xlabel('Total Number of Deaths', fontsize = 15)
plt.ylabel('Diseases', fontsize = 15)
plt.title('Top 10 cause of deaths in EGYPT during 1990-2019', fontsize =15)
Text(0.5, 1.0, 'Top 10 cause of deaths in EGYPT during 1990-2019')
# Create Treemap
fig = px.treemap(EG_disease,
path = [px.Constant('Total_no_of_Deaths'), 'Diseases'],
values = 'Total_no_of_Deaths'
)
fig.update_traces(textinfo='label+percent parent')
fig.update_layout(title_text='Percentage of cause of deaths in EGYPT during 1990-2019', title_x=0.5, font_size=15)
fig.show()
Egypt_df.columns
Index(['Country/Territory', 'Code', 'Year', 'Meningitis',
'Alzheimer's Disease and Other Dementias', 'Parkinson's Disease',
'Nutritional Deficiencies', 'Malaria', 'Drowning',
'Interpersonal Violence', 'Maternal Disorders', 'HIV/AIDS',
'Drug Use Disorders', 'Tuberculosis', 'Cardiovascular Diseases',
'Lower Respiratory Infections', 'Neonatal Disorders',
'Alcohol Use Disorders', 'Self-harm', 'Exposure to Forces of Nature',
'Diarrheal Diseases', 'Environmental Heat and Cold Exposure',
'Neoplasms', 'Conflict and Terrorism', 'Diabetes Mellitus',
'Chronic Kidney Disease', 'Poisonings', 'Protein-Energy Malnutrition',
'Road Injuries', 'Chronic Respiratory Diseases',
'Cirrhosis and Other Chronic Liver Diseases', 'Digestive Diseases',
'Fire, Heat, and Hot Substances', 'Acute Hepatitis',
'Total_no_of_Deaths'],
dtype='object')
# Find the total number of deaths in Egypt group by year
EG_Deaths_by_year = Egypt_df.groupby('Year')['Total_no_of_Deaths'].sum().reset_index()
EG_Deaths_by_year
| Year | Total_no_of_Deaths | |
|---|---|---|
| 0 | 1990 | 468409 |
| 1 | 1991 | 457493 |
| 2 | 1992 | 447129 |
| 3 | 1993 | 448294 |
| 4 | 1994 | 448394 |
| 5 | 1995 | 438552 |
| 6 | 1996 | 432618 |
| 7 | 1997 | 435796 |
| 8 | 1998 | 436987 |
| 9 | 1999 | 438946 |
| 10 | 2000 | 429426 |
| 11 | 2001 | 448267 |
| 12 | 2002 | 463366 |
| 13 | 2003 | 481773 |
| 14 | 2004 | 483542 |
| 15 | 2005 | 481658 |
| 16 | 2006 | 485863 |
| 17 | 2007 | 488266 |
| 18 | 2008 | 499896 |
| 19 | 2009 | 514724 |
| 20 | 2010 | 523236 |
| 21 | 2011 | 529436 |
| 22 | 2012 | 544796 |
| 23 | 2013 | 540252 |
| 24 | 2014 | 551643 |
| 25 | 2015 | 581847 |
| 26 | 2016 | 587922 |
| 27 | 2017 | 588017 |
| 28 | 2018 | 596617 |
| 29 | 2019 | 605194 |
# Create line chart
plt.figure(figsize=(12,6))
sns.lineplot(data = EG_Deaths_by_year, x='Year', y = 'Total_no_of_Deaths')
plt.xlabel('Year',fontsize =12)
plt.ylabel('Total Number of Deaths',fontsize =12)
plt.title('Time Series of Total Number of Deaths in EGYPT', fontsize=15)
Text(0.5, 1.0, 'Time Series of Total Number of Deaths in EGYPT')
# Create Time series of top 5 cause of deaths in EGYPT
top5_diseases = ["Cardiovascular Diseases",
"Neoplasms",
"Chronic Respiratory Diseases",
"Alzheimer's Disease and Other Dementias",
"Digestive Diseases"]
plt.figure(figsize=(12,8))
for i in top5_diseases:
sns.lineplot(data = Egypt_df,
x = 'Year',
y = Egypt_df[i],
label = i
)
plt.xlabel('Year',fontsize =12)
plt.ylabel('Total Number of Deaths',fontsize =12)
plt.title('Time Series of top 5 cause of deaths in EGYPT', fontsize=15)
Text(0.5, 1.0, 'Time Series of top 5 cause of deaths in EGYPT')
The latest year from this dataset is 2019.
So I would like to know the latest information of cause of deaths in EGYPT
Egypt_df.tail()
| Country/Territory | Code | Year | Meningitis | Alzheimer's Disease and Other Dementias | Parkinson's Disease | Nutritional Deficiencies | Malaria | Drowning | Interpersonal Violence | ... | Chronic Kidney Disease | Poisonings | Protein-Energy Malnutrition | Road Injuries | Chronic Respiratory Diseases | Cirrhosis and Other Chronic Liver Diseases | Digestive Diseases | Fire, Heat, and Hot Substances | Acute Hepatitis | Total_no_of_Deaths | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1645 | Egypt | EGY | 2012 | 1102 | 5746 | 1989 | 880 | 0 | 1104 | 618 | ... | 18053 | 189 | 717 | 28813 | 20395 | 53786 | 59861 | 1497 | 1542 | 544796 |
| 1646 | Egypt | EGY | 2013 | 1082 | 5816 | 1964 | 844 | 0 | 1095 | 684 | ... | 18037 | 187 | 684 | 28515 | 20325 | 53648 | 59739 | 1481 | 1481 | 540252 |
| 1647 | Egypt | EGY | 2017 | 823 | 6469 | 2296 | 826 | 0 | 1005 | 522 | ... | 20956 | 182 | 650 | 29308 | 21957 | 59798 | 66243 | 1494 | 1397 | 588017 |
| 1648 | Egypt | EGY | 2018 | 790 | 6681 | 2366 | 817 | 0 | 988 | 516 | ... | 21461 | 180 | 639 | 29391 | 22235 | 61156 | 67659 | 1492 | 1373 | 596617 |
| 1649 | Egypt | EGY | 2019 | 764 | 6918 | 2439 | 816 | 0 | 972 | 512 | ... | 21981 | 179 | 634 | 29490 | 22560 | 62635 | 69216 | 1496 | 1356 | 605194 |
5 rows × 35 columns
# Create a new data frame of Egypt year 2019
EG_2019 = Egypt_df[Egypt_df['Year'] == 2019]
EG_2019
| Country/Territory | Code | Year | Meningitis | Alzheimer's Disease and Other Dementias | Parkinson's Disease | Nutritional Deficiencies | Malaria | Drowning | Interpersonal Violence | ... | Chronic Kidney Disease | Poisonings | Protein-Energy Malnutrition | Road Injuries | Chronic Respiratory Diseases | Cirrhosis and Other Chronic Liver Diseases | Digestive Diseases | Fire, Heat, and Hot Substances | Acute Hepatitis | Total_no_of_Deaths | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1649 | Egypt | EGY | 2019 | 764 | 6918 | 2439 | 816 | 0 | 972 | 512 | ... | 21981 | 179 | 634 | 29490 | 22560 | 62635 | 69216 | 1496 | 1356 | 605194 |
1 rows × 35 columns
# Find the total number of each disease in Egypt in 2019
disease_2019 = EG_2019[cause_of_deaths].sum().to_frame().reset_index()
disease_2019.rename(columns={'index': 'Diseases', 0:'Total_deaths'}, inplace=True)
disease_2019
| Diseases | Total_deaths | |
|---|---|---|
| 0 | Meningitis | 764 |
| 1 | Alzheimer's Disease and Other Dementias | 6918 |
| 2 | Parkinson's Disease | 2439 |
| 3 | Nutritional Deficiencies | 816 |
| 4 | Malaria | 0 |
| 5 | Drowning | 972 |
| 6 | Interpersonal Violence | 512 |
| 7 | Maternal Disorders | 751 |
| 8 | HIV/AIDS | 56 |
| 9 | Drug Use Disorders | 82 |
| 10 | Tuberculosis | 892 |
| 11 | Cardiovascular Diseases | 263873 |
| 12 | Lower Respiratory Infections | 21371 |
| 13 | Neonatal Disorders | 5336 |
| 14 | Alcohol Use Disorders | 150 |
| 15 | Self-harm | 3105 |
| 16 | Exposure to Forces of Nature | 0 |
| 17 | Diarrheal Diseases | 8474 |
| 18 | Environmental Heat and Cold Exposure | 42 |
| 19 | Neoplasms | 57934 |
| 20 | Conflict and Terrorism | 682 |
| 21 | Diabetes Mellitus | 20478 |
| 22 | Chronic Kidney Disease | 21981 |
| 23 | Poisonings | 179 |
| 24 | Protein-Energy Malnutrition | 634 |
| 25 | Road Injuries | 29490 |
| 26 | Chronic Respiratory Diseases | 22560 |
| 27 | Cirrhosis and Other Chronic Liver Diseases | 62635 |
| 28 | Digestive Diseases | 69216 |
| 29 | Fire, Heat, and Hot Substances | 1496 |
| 30 | Acute Hepatitis | 1356 |
# Find Top 5 cause of deaths in EGYPT in 2019
top5_2019 = disease_2019.groupby('Diseases')['Total_deaths'].sum().sort_values(ascending=False).head(5).reset_index()
top5_2019
| Diseases | Total_deaths | |
|---|---|---|
| 0 | Cardiovascular Diseases | 263873 |
| 1 | Digestive Diseases | 69216 |
| 2 | Cirrhosis and Other Chronic Liver Diseases | 62635 |
| 3 | Neoplasms | 57934 |
| 4 | Road Injuries | 29490 |
# Create bar chart of Top 5 Cause of Deaths in EGYPT in 2019
plt.figure(figsize=(12,6))
sns.barplot(data = top5_2019, x = 'Total_deaths', y = 'Diseases', color = 'Blue')
plt.xlabel('Total Number of Deaths', fontsize = 12)
plt.ylabel('Cause of Deaths', fontsize = 12)
plt.title('Top 5 Cause of Deaths in EGYPT in 2019', fontsize =15)
Text(0.5, 1.0, 'Top 5 Cause of Deaths in EGYPT in 2019')
# Try to create pie chart
fig, ax = plt.subplots()
ax.pie(top5_2019['Total_deaths'], labels= top5_2019['Diseases'], autopct='%1.1f%%')
ax.set_title('Top 5 Cause of Deaths in EGYPT in 2019', fontsize =15)
Text(0.5, 1.0, 'Top 5 Cause of Deaths in EGYPT in 2019')
# Create Treemap
fig = px.treemap(disease_2019,
path = [px.Constant('Total_deaths'), 'Diseases'],
values = 'Total_deaths'
)
fig.update_traces(textinfo='label+percent parent')
fig.update_layout(title_text='Percentage of Cause of Deaths in EGYPT in 2019', title_x=0.5, font_size=15)
fig.show()
Time series of data not related to disease in EGYPT
I excluded the data of column 'Road Injuries' and 'Self-harm'.
Because the range of the data will be too high, resulting in the line chart being too wide and hard to read.
interest_data = ['Drowning',
'Interpersonal Violence',
'Drug Use Disorders',
'Alcohol Use Disorders',
'Environmental Heat and Cold Exposure',
'Fire, Heat, and Hot Substances',
'Poisonings']
plt.figure(figsize=(16,9))
for i in interest_data:
sns.lineplot(data = Egypt_df,
x = 'Year',
y = Egypt_df[i],
label = i
)
plt.xlabel('Year',fontsize =12)
plt.ylabel('Total Number of Deaths',fontsize =12)
plt.title('Time Series of Data Not Related to Disease in EGYPT', fontsize=15)
Text(0.5, 1.0, 'Time Series of Data Not Related to Disease in EGYPT')
# Bar Chart: Adding labels on bars for total deaths
plt.figure(figsize=(12,8))
sns.barplot(data=Top10_EG_diseases, x='Total_no_of_Deaths', y='Diseases', color='Blue')
# Add labels on bars
for i in range(Top10_EG_diseases.shape[0]):
plt.text(Top10_EG_diseases['Total_no_of_Deaths'].values[i], i, f'{Top10_EG_diseases["Total_no_of_Deaths"].values[i]:,}', va='center')
# Add some text for labels and title
plt.xlabel('Total Number of Deaths', fontsize=15)
plt.ylabel('Diseases', fontsize=15)
plt.title('Top 10 Causes of Deaths in Egypt (1990-2019)', fontsize=15)
Text(0.5, 1.0, 'Top 10 Causes of Deaths in Egypt (1990-2019)')
# Time Series of Top 5 Causes
plt.figure(figsize=(12,8))
for i in top5_diseases:
sns.lineplot(data=Egypt_df, x='Year', y=Egypt_df[i], label=i, marker='o', markersize=5)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Total Number of Deaths', fontsize=12)
plt.title('Time Series of Top 5 Causes of Deaths in Egypt', fontsize=15)
plt.legend(loc='upper left')
plt.grid(True)
# Cause of deaths in Egypt in 2019
Egypt_2019 = Egypt_df[Egypt_df['Year'] == 2019]
latest_deaths = Egypt_2019[cause_of_deaths].sum().to_frame().reset_index()
latest_deaths.rename(columns={'index': 'Diseases', 0: 'Total_no_of_Deaths'}, inplace=True)
latest_deaths.sort_values(by='Total_no_of_Deaths', ascending=False, inplace=True)
latest_deaths.head(10) # Display the top 10 causes of death in 2019
| Diseases | Total_no_of_Deaths | |
|---|---|---|
| 11 | Cardiovascular Diseases | 263873 |
| 28 | Digestive Diseases | 69216 |
| 27 | Cirrhosis and Other Chronic Liver Diseases | 62635 |
| 19 | Neoplasms | 57934 |
| 25 | Road Injuries | 29490 |
| 26 | Chronic Respiratory Diseases | 22560 |
| 22 | Chronic Kidney Disease | 21981 |
| 12 | Lower Respiratory Infections | 21371 |
| 21 | Diabetes Mellitus | 20478 |
| 17 | Diarrheal Diseases | 8474 |
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Example of linear regression
X = Egypt_df[['Cardiovascular Diseases', 'Neoplasms', 'Chronic Respiratory Diseases', 'Alzheimer\'s Disease and Other Dementias', 'Digestive Diseases']]
y = Egypt_df['Total_no_of_Deaths']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R2 Score:", r2_score(y_test, y_pred))
Mean Squared Error: 13111629.863343896 R2 Score: 0.995502477370117